Asynchronous Interconnection Network and Communication

About This Presentation

Title:

Asynchronous Interconnection Network and Communication

Description:

Asynchronous Interconnection Network and Communication Chapter 3 of Casanova, et. al. Hypercube Routing (cont) This routing algorithm can be used to implement a ... – PowerPoint PPT presentation

Number of Views:158

Avg rating:3.0/5.0

Slides: 84

Provided by: JBa999

Learn more at: https://www.cs.kent.edu

Category:

more less

Transcript and Presenter's Notes

Title: Asynchronous Interconnection Network and Communication

1
Asynchronous Interconnection Network and
Communication

Chapter 3 of Casanova, et. al.

2
Interconnection NetworkTopologies

The processors in a distributed memory parallel
system are connected using an interconnection
network.
All computers have specialized coprocessors that
route messages and place date in local memories
Nodes consist of a (computing) processor, a
memory, and a communications coprocessor
Nodes are often called processors, when not
ambigious.

3
Network Topology Types

Static Topologies
A fixed network that cannot be changed
Nodes connected directly to each other by
point-to-point communications links
Dynamic Topologies
Topology can change at runtime
One or more nodes can request direct
communication be established between them.
Done using switches

4
Some Static Topologies

Fully connected network (or clique)
Ring
Two-Dimensional grid
Torus
Hypercube
Fat tree

5
Examples of Interconnection Topologies
6
Static Topologies Features

Fixed number of nodes
Degree
Nr of nodes incident to edges
Distance between nodes
Length of shortest path between two nodes
Diameter
Largest distance between two nodes
Number of links
Total number of Edges
Bisection Width
Minimum nr. of edges that must be removed to
partition the network into two disconnected
networks of the same size.

7
Classical Interconnection Networks Features

Clique (or Fully Connected)
All processors are connected
p(p-1)/2 edges
Ring
Very simple and very useful topology
2D Grid
Degree of interior processors is 4
Not symmetric, as edge processors have different
properties
Very useful when computations are local and
communications are between neighbors
Has been heavily used previously

8
Classical Network

2D Torus
Easily formed from 2D mesh by connecting matching
end points.
Hypercube
Has been extensively used
Using recursive defn, can design simple but very
efficient algorithms
Has small diameter that is logarithmic in nr of
edges
Degree and total number of edges grows too
quickly to be useful with massively parallel
machines.

9
(No Transcript)
10
Dynamic Topologies

Fat tree is different than other networks
included
The compute nodes are only at the leaves.
Nodes at higher level do not perform computation
Topology is a binary tree both in 2D front view
and in side view.
Provides extra bandwidth near root.
Used by Thinking Machine Corp. on the CM-5
Crossbar Switch
Has p2 switches, which is very expensive for
large p
Can connect n processors to combination of n
processors
.Cost rises with the number of switches, which is
quadratic with number of processors.

11
(No Transcript)
12
Dynamic Topologies (cont)

Benes Network Omega Networks
Use smaller size crossbars arranged in stages
Only crossbars in adjacent stages are connected
together.
Called multi-stage networks and are cheaper to
build that full crossbar.
Configuring multi-stage networks is more
difficult than crossbar.
Dynamic networks are now the most common used
topologies.

13
A Simple Communications Performance Model

Assume a processor Pi sends a message to Pj or
length m.
Cost to transfer message along a network link is
roughly linear in message length.
Results in cost to transfer message along a
particular route to be roughly linear in m.
Let ci,j(m) denote the time to transfer this
message.

14
Hockney Performance Model for Communications

The time ci,j(m) to transfer this message can be
modeled by
ci,j(m) Li,j m/Bi,j Li,j mbi,j
m is the size of the message
Li,j is the startup time, also called latency
Bi,j is the bandwidth, in bytes per second
bi,j is 1/Bi,j, the inverse of the bandwidth
Proposed by Hockney in 1994 to evaluate the
performance of the Intel Paragon.
Probably the most commonly used model.

15
Hockney Performance Model (cont.)

Factors that Li,j and Bi,j depend on
Length of route
Communication protocol used
Communications software overhead
Ability to use links in parallel
Whether links are half or full duplex
Etc.

16
Store and Forward Protocol

SF is a point-to-point protocol
Each intermediate node receives and stores the
entire message before retransmitting it
Implemented in earliest parallel machines in
which nodes did not have communications
coprocessors.
Intermediate nodes are interrupted to handle
messages and route them towards their destination.

17
Store and Forward Protocol (cont)

If d(i,j) is the number of links between Pi Pj,
the formula for ci,j(m) can be re-written as
ci,j(m) d(i,j) L m/B d(i,j)L d(i,j)mb
where
L is the initial latency b is the reciprocal
for the broadcast bandwidth for one link.
This protocol produces a poor latency bandwith
The communication cost can be reduced using
pipelining.

18
Store and Forward Protocol using Pipelining

The message is split into r packets of size m/r.
The packets are sent one after another from Pi to
Pj.
The first packet reaches node j after ci,j(m/r)
time units.
The remaining r-1 packets arrive in
(r-1) (L mb/r) time units
Simplifying, total communication time reduces to
d(i,j) -1rL mb/r
Casanova, et.al. finds optimal value for r above.

19
Two Cut-Through Protocols

Common performance model
ci,j(m) L d(i,j) ? m/B
where
L is the one-time cost of creating a message.
? is the routing management overhead
Generally ? ltlt L as routing management is
performed by hardware while L involve software
overhead
m/B is the time required to transmit the message
through entire route

20
Circuit-Switching Protocol

First cut-through protocol
Route created before first message is sent
Message sent directly to destination through this
route
The nodes used in this transmission can not be
used during this transmission for any other
communication

21
Wormhole (WH) Protocol

A second cut-through protocol
The destination address is stored in the header
of the message.
Routing is performed dynamically at each node.
Message is split into small packets called flits
If two flits arrive at the same time, flits are
stored in intermediate nodes internal registers

22
Point-to-Point Communication Comparisons

Store and Forward is not used in physical
networks but only at applications level
Cut-through protocols are more efficient
Hide distance between nodes
Avoid large buffer requirement for intermediate
nodes
Almost no message loss
For small networks, flow-control mechanism not
needed
Wormhole generally preferred to circuit
switching
Latency is normally much lower

23
(No Transcript)
24
LogP Model

Models based on the LogP model are more precise
than the Hockney model
Involves three components of communication the
sender, the network, and the receiver
At times, some of these components may be busy
while others are not.
Some parameters for LogP
m is the message size (in bytes)
w is the size of packets message is split into
L is an upper bound on the latency
o is the overhead,
Defined to be the time that the a node is engaged
in the transmission or reception of a packet

25
LogP Model (cont)

Parameters for LogP (cont)
g or gap is the minimal time interval between
consecutive packet transmission or packet
reception
During this time, a node may not use the
communication coprocessor (i.e., network card)
1/g the communication bandwidth available per
node
P the number of nodes in the platform
Cost of sending m bytes with packet size w
Processor occupational time on sender/receiver

26
(No Transcript)
27
Other LogP Related Models

LogP attempts to capture in a few parameters the
characteristics of parallel platforms.
Platforms are fine-tuned and may use different
protocols for short long messages
LogGP is an extension of LogP where G captures
the bandwidth for long messages
pLogP is an extension of LogP where L, o, and g
depend on the message size m.
Also seperates sender overhead os and receiver
overhead or.

28
Affine Models

The use of the floor functions in LogP models
causes them to be nonlinear.
Causes many problems in analytic theoretical
studies.
Has resulted in proposal of many fully linear
models
The time that Pi is busy sending a message is
expressed as an affine function of the message
size
An affine function of m has form f(m) am b
where a and b are constants. If b0, then f is
linear function
Similarly, the time Pj is busy receiving the
message is expressed as an affine function of the
message size
We will postpone further coverage of the topic of
affine models for the present

29
Modeling Concurrent Communications

Multi-port model
Assumes that communications are contention-free
and do not interfere with each other.
A consequence is that a node may communicate with
an unlimited number of nodes without any
degradation in performance.
Would require a clique interconnection network to
fully support.
May simplify proofs that certain problems are
hard
If hard under ideal communications conditions,
then hard in general.
Assumption not realistic - communication
resources are always limited.
See Casanova text for additional information.

30
Concurrent Communications Models (2/5)

Bounded Multi-port model
Proposed by Hong and Prasanna
For applications that uses threads (e.g., on a
multi-core technology), the network link can be
shared by several incoming and outgoing
communications.
The sum of bandwidths allocated by operating
system to all communications can not exceed
bandwidth of network card.
An unbounded nr of communications can take place
if they share the total available bandwidth.
The bandwidth defines the bandwidth allotted to
each communication
Bandwidth sharing by application is unusual, as
is usually handled by operating system.

31
Concurrent Communications Models (3/5)

1 port (unidirectional or half-duplex) model
Avoids unrealistically optimistic assumptions
Forbids concurrent communication at a node.
A node can either send data or receive it, but
not simultaneously.
This model is very pessimistic, as real world
platforms can achieve some concurrent
computations.
Model is simple and is easy to design algorithms
that follow this model.

32
Concurrent Communications Models (4/5)

1 port (bidirectional or full-duplex) model
Currently, most network cards are full-duplex.
Allows a single emission and single reception
simultaneously.
Introduced by Blat, et. al.
Current hardware does not easily enable multiple
messages to be transmitted simultaneous.
Multiple sends and receives are claimed to be
eventually serialized by a single hardware port
to the next.
Saif Parashar did experimental work that
suggests asynchronous sends become serialized
when message sizes exceed a few megabytes.

33
Concurrent Communications Models (5/5)

k-ports model
A node may have kgt1 network cards
This model allows a node to be involved in a
maximum of one emission and one reception on each
network card.
This model is used in Chapters 4 5.

34
Bandwidth Sharing

The previous concurrent communication models only
consider contention on nodes
Other parts of the network can also limit
performance
It may be useful to determine constraints on each
network link
This type of network model are useful for
performance evaluation purposes, but are too
complicated for algorithm design purposes.
Casanova text evalutes algorithms using 2 models
Hockney model or even simplified versions (e.g.
assuming no latency)
Multi-port (ignoring contention) or the 1 port
model.

35
Case Study Unidirectional Ring

We first consider the platform of p processors
arranged in a unidirectional ring.
Processors are denoted Pk for k 0, 1, , p-1.
Each PE can find its logical index by calling
My_Num().

36
Unidirectional Ring Basics

The processor can determine the number of PEs by
calling NumProcs()
Both preceding commands are supported in MPI, a
language implemented on most asychronous systems.
Each processor has its own memory
All processors execute the same program, which
acts on data in their local memories
Single Program, Multiple Data or SPMD
Processors communicate by message passing
explicitly sending and receiving messages.

37
Unidirectional Ring Basics (cont 2/5)

A processor sends a message using the function
send(addr,m)
addr is the memory address (in the sender
process) of first data item to be sent.
m is the message length (i.e., nr of items to be
sent)
A processor receives a message using function
receive(addr,m)
The addr is the local address in receiving
processor where first data item is to be stored.
If processor Pi executes a receive, then its
predecessor (P(i-1)mod p) must execute a send.
Since each processor has a unique predecessor and
successor, they do not have to be specified

38
Unidirectional Ring Basics (cont 3/5)

A restrictive assumption is to assume that both
the send and receive is blocking.
Then the participating processes can not continue
until the communication is complete.
The blocking assumption is typical of 1st
generation platforms
A classical assumption is keep the receive
blocking but to allow the send is non-blocking
The processor executing a send can continue while
the data transfer takes place.
To implement, one function is used to initiate
the send and another function is used to
determine when communication is finished.

39
Unidirectional Ring Basics (cont 4/5)

In algorithms, we simply indicate the blocking
and non-blocking operations
More recent proposed assumption is that a single
processor can send data, receive data, and
compute simultaneously.
All three can occur only if no race condition
exists.
Convenient to think of three logical threads of
control running on a processor
One for computing
One for sending data
One for receiving data
We will usually use the less restrictive third
assumption

40
Unidirectional Ring Basics (cont 4/5)

Timings for Send/Receive
We use a simplified version of the Hockney model
The time to send or receive over one link is
c(m) L mb
m is the length of the message
L is the startup cost in seconds due to the
physical latency and the software overhead
b is the inverse of the data transfer rate.

41
The Broadcast Operation

The broadcast operation allows an processor Pk to
send the same message of length m to all other
processors.
At the beginning of the broadcast operation, the
message is stored at the address addr in the
memory of the sending process, Pk.
At the end of the broadcast, the message will be
stored at address addr in the memory of all
processors.
All processors must call the following function
Broadcast(k, addr, m)

42
Broadcast Algorithm Overview

The message will go around the ring from
processor - from Pk to Pk1 to Pk2 to to
Pk-1.
We will assume the processor numbers are modulo
p, where p is the number of processors. For
example, if k0 and p8, then k-1 p-1 7.
Note there is no parallelism in this algorithm,
since the message advances around ring only one
processor per round.
The predecessor of Pk (i.e, Pk-1) does not send
the message to Pk.

43
(No Transcript)
44
Analysis of Broadcast Algorithm

For algorithm to be correct, the receive in
Step 10 will execute before Step 11.
Running Time
Since we have a sequence of p-1 communications,
the time to broadcast a message of length m is
(p-1)(Lmb)
MPI does not typically use ring topology for
creating communication primitives
Instead use various tree topologies that are more
efficient on modern parallel computer platforms.
However, these primitives are simpler on a ring.
Prepares readers to implement primitives, when
more efficient than using MPI primitives.

45
Scatter Algorithm

Scatter operation allows Pk to send a different
message of length m to each processor.
Initially, Pk holds a message of length m to be
sent to Pq at location addr q.
To keep the array of addresses uniform, space for
a message to Pk is also provided.
At the end of the algorithm, each processor
stores its message from Pk at location msg.
The efficient way to implement this algorithm is
to pipeline the messages.
Message to most distant processor (i.e,, Pk-1) is
followed by message to processor Pk-2.

46
(No Transcript)
47
Discussion of Scatter Algorithm

In Steps 5-6, Pk successively send messages to
the other p-1processors in the order of their
distance from Pk .
In Step 7, Pk stores its message to itself.
The other processors concurrently move messages
along as they arrive in steps 9-12.
Each processor uses two buffers with addresses
tempS and tempR.
This allows processors to send a message and to
receive the next message in parallel in Step12.

48
Discussion of Scatter Algorithm (cont)

In step 11, tempS ? tempR means two addresses are
switched so received value can be sent to next
processor.
When a processor receives its message from Pk,
the processor stops forwarding (Step 10).
Whatever is in the receive buffer, tempR, at the
end is stored as its message from Pk (Step 13).
The running time of the scatter algorithm is the
same as for the broadcast, namely
(p-1)(Lmb)

49
Example for Scatter Algorithm

Example In Figure 3.7, let p6 and k4.
Steps 5-6 For i 1 to p-1 do
send(addr(kp-i) mod p, m)
Let PE (kp-i) mod p (10 i) mod 6
For i1, PE 9 mod 6 3
For i2, PE 8 mod 6 2
For i3, PE 7 mod 6 1
For i4, PE 6 mod 6 0
For i5, PE 5 mod 6 5
Note messages are sent to processors in the order
3, 2, 1, 0, 5
That is, messages to most distant processors sent
first.

50
Example for Scatter Algorithm (cont)

Example In Figure 3.7, let p6 and k4.
Steps 10 For i 1 to (k-1-q) mod p do
Compute (k-1-q) mod p (3-q) mod 6 for all q.
Note q? k, which is 4
q 5 ? i 1 to 4 since (3-5) mod 6 4
PE 5 forwards values in loop from i 1 to 4
q 0 ? i 1 to 3 since (3-0) mod 6 3
PE 0 forwards values from i 1 to 3
q 1 ? i 1 to 2 since (3-1) mod 6 2
PE 1 forwards values from i 1 to 2
q 2 ? i 1 to 1 since (3-2) mod 6 1
PE 2 is active in loop when i 1

51
Example for Scatter Algorithm (cont)

q 3 ? i 1 to 0 since (3-3) mod 6 0
PE 3 precedes PE k, so it never forwards a value
However, it receives and stores a message at 9
Note that in step 9, all processors store the
first message they receive.
That means even the processor k-1 receives a
value to store.

52
All-to-All Algorithm

This command allows all p processors to
simultaneously broadcast a message (to all PEs)
Again, it is assumed all messages have length m
At the beginning, each processor holds the
message it wishes to broadcast at address
my_message.
At the end, each processor will hold an array
addr of p messages, where addrk holds message
from Pk.
Using pipelining, the running time is the same as
for a single broadcast, namely
(p-1)(Lmb)

53
(No Transcript)
54
Gossip Algorithm

Last of the classical collection of communication
operations
Each processor sends a different message to each
processor.
Gossip algorithm is Problem 3.7 in textbook.
Note it will take 1 step for each PE to send a
message to its closest neighbor using all links.
It will take 2 steps for each PE to send a
message to its 2nd closest neighbor, using all
links.
In general, it will take 12 (p-1) (p-1)!
steps for each PE to send messages to all other
nodes, using all links of the network during each
step.
Complexity is (p-1)! (p-1)(p-2)/2 O(p2).

55
Pipelined Broadcast by kth processor

Longer messages can be broadcast faster if they
are broken into smaller pieces
Suppose they are broken into r pieces of the same
length.
The sender sends the pieces out in order, and has
them travel simultaneously on the ring.
Initially, the pieces are stored at addresses
addr0, addr1, , addrr-1.
At the end, all pieces are stored in all
processors.
At each step, when a processor receives a message
piece, it also forwards the piece it previously
received, if any, to its successor.

56
(No Transcript)
57
Pipelined Broadcast (cont)

There must be p-1 communication steps for first
piece to reach the last processor, Pk-1.
Then it takes r-1 time for the rest of the pieces
to reach Pk-1.
The required time is (pr-2) ( L mb/r)
The value of r that minimizes this expression can
be found by setting its derivative (with respect
to r) to zero and solving for r.
For large m, the time required tends to mb
Does not depend on p.
Compares well to broadcast time (p-1)(Lmb)

58
Hypercube

Defn A 0-cube consists of a single vertex. For
ngt0, an n-cube consists of two identical
(n-1)-cubes with edges added to join matching
pairs of vertices in the two (n-1) cubes.

59
Hypercubes (cont)

Equivalent defn An n-cube is a graph consisting
of 2n vertices from 0 to 2n -1 such that two
vertices are connected if and only if their
binary representation differs by one single bit.
Property The diameter and degree of an n-cube
are equal to n.
Proof is left for reader. Easy if use recursion.
Hamming Distance Let A and B be two points in an
n-cube. H(A,B) is the number of bits that differ
between the binary labels of A and B.
Notation If b is a binary bit, then let b 1-b,
the bit complement of b.

60
Hypercube Paths

Using binary representation, let
A an-1 an-2 a2 a1 a0
B bn-1 bn-2 b2 b1 b0
WLOG, assume that A and B have different bits
exactly in their last k bits.
Differing bits at end makes numbering them easier
Then a pathway from A to B can be created by the
following sequence of nodes
A an-1 an-2 a2 a1 a0
Vertex 1 an-1 an-2 a2 a1 a0
Vertex 2 an-1 an-2 a2 a1 a0
........
B Vertex k an-1 an-2 ak ak-1 a2 a1 a0

61
Hypercube Paths (cont)

Independent of which bits of A and B agree, there
are k choices for first bit to flip, (k-1)
choices for next bit to flip, etc.
This gives k! different paths from A to B.
How many independent paths exist from A to B?
I.e., paths with only A and B as common vertices.
Theorem If A and B are n-cube vertices that
differ in k bits, then there exist exactly k
independent paths from A to B.
Proof First, we show k independent paths exist.
We build an independent path for each j with
0?jltk

62
Hypercube Paths (cont)

Let P(j, j-1, j-2, , 0, k-1, k-2, , j1)
denote the path from A to B with following
sequence of nodes
A an-1 an-2 ak ak-1 ak-2 aj aj-1 aj-2 a2
a1 a0
V(1) an-1 an-2 ak ak-1 ak-2 aj aj-1 aj-2
a2 a1 a0
V(2) an-1 an-2 ak ak-1 ak-2 aj aj-1 aj-2
a2 a1 a0
.........
V(j1) an-1 an-2 ak ak-1 ak-2 aj aj-1 aj-2
a2 a1 a0
V(j2) an-1 an-2 ak ak-1 ak-2 aj aj-1 aj-2
a2 a1 a0
.............
V(k-1) an-1 an-2 ak ak-1 ak1 aj1aj aj-1
aj-2 a2 a1 a0
B V(k)an-1 an-2 ak ak-1 ak1 aj1aj aj-1
aj-2 a2 a1 a0

63
Hypercube Pathways (cont)

Suppose the following two path have a common
vertex X other than A and B.
P(j, j-1, j-2, , 0, k-1, k-2, , j1)
P(t, t-1, t-2, , 0, k-1, k-2, , t1)
Since paths are different, and A and B differ by
k bits, we may assume 0tltjltk
Let A and X differ in q bits
To travel from A to X along either path, exactly
q bits in circular sequence have been flipped, in
a left to right order for each
1st path j, j-1 t, t-1 0, k-1, k-2 j-1
2nd path t, t-1 0, k-1, k-2
j-1 t-1
This is impossible, as the q bits are flipped
along each path can not be exactly the same bits.

64
Hypercube Paths (cont)

Finally, there can not be another independent
path Q from A to B (i.e., without another common
vertex)
If so, the first node in path following A would
have to flipped one bit, say bit q, to agree
with B.
But then, the path described earlier that flipped
bit q first would have a common interior vertex
with this path Q.

65
Hypercube Routing

XOR is an exclusive OR. Exactly one input must be
a 1.
To design a route from A to B in the n-cube, we
will use the algorithm that will always flip the
rightmost bit that disagrees with B.
The NOR of the binary representation of A and B
indicates by 1s the bits that have to be
flipped.
For the 5-cube, if A is 10111 and B is 01110,then
A NOR B is 11001, and the routing is as follows
A 10111 ? 10110 ? 11110 ? 01110 B
This algorithm can be executed as follows
A NOR B 10111 NOR 01110 11001, so A routes
the message along link 1 (the digits link) to
Node A1 10110
A1 NOR B 10110 NOR 01110 11000, so Node A1
routes message along link 4 to Node A2 10110
A2 NOR B 10110 NOR 01110 10000, so A2
routes message along link 5 to B

66
Hypercube Routing (cont)

This routing algorithm can be used to implement a
wormhole or cut-through protocol in hardware.
Problem If another pair of processors have
already reserved one link on the desired path,
the message may stall until the end of the other
communication.
Solution Since there are multiple paths, the
routers select links to use based on a link
reservation table and message labels.
In our example, if link 1 in Node A is busy, then
instead use link 4 to forward the message to new
node 11111 that is on a new path to B.
If at some point, the current vertex determines
there is no useful links available, then current
vertex will have to wait for a useful link to
become available
If at some vertex, the desired links are not
available, then algorithm could use a link that
extends path length

67
Gray Code

Recursive construction of Gray code
G1 (0,1) and has 21 2 elements
G2 (00,01,11, 10) and had 22elements
G3 (000, 001, 011, 010, 110, 111, 101, 100)
has 23 elements
etc.
Gray code for dimension n ? 1 is denoted Gn, and
is defined recursively. G1 (0,1) and for ngt1,
Gn is the sequence 0Gn-1 followed by the sequence
1Gn-1rev (i.e., ) where
xGn-1 is the sequence obtained by prefixing every
element of G with x
Grev is the sequence obtained by listing the
elements of G in the reverse order
.

68
Gray Code (cont.)

Since we assume Gn-1 has 2n-1 elements and G n
has exactly two copies of Gn-1 , it has 2n
elements.
Summary The Gray code Gn is an ordered sequence
of all the 2n binary codes with n digits whose
successive values differ from each other by
exactly one bit.
Notation Let gi(r) denote the ith element of the
Gray code of dimension r.
Observation Since the Gray code Gn g1(n),
g2(n), , g(n) form a ordered sequence of names
for all of the nodes in a 2n ring.

69
Embeddings

Defn An embedding of a topology (e.g., ring, 2D
mesh, etc) into an n-cube is a 1-1 function f
from the vertices of the topology into the
n-cube.
An embedding is said to preserve locality if the
image of any two neighbors are also neighbors in
the n-cube
If embeddings do not preserve locality, then we
try to minimize the distance of neighbors in the
hypercube.
An embedding is said to be onto if the embedding
function f has its range to be the entire n-cube.

70
A 2n Ring embedding onto n-cube

Theorem There is an embedding of a ring with 2n
vertices onto an n-cube that preserves locality.
Proof
Our construction of the Gray code provides an
ordered sequence of binary values with n digits
that can be used as names of the nodes of the
ring with 2n vertices
The first name (e.g., 000) can be used to name
any node in ring.
The sequence of Gray code names are given names
to ring nodes successively, in a clockwise or
counter-clockwise order

71
Embedding Ring onto n-cube (cont)

The gray code binary numbers are identical with
the names assigned to hypercube nodes.
Two successive n-cube nodes are connected since
they only differ by one binary digi
So the embedding is a result of the Gray code
providing an ordered sequence of n-cube names
that is used to provide successive names the 2n
ring nodes.
This concludes the proof that this embedding
follows from construction of Gray code.
However, the following formal proof provides
more details and an after construction
argument..

72
A 2n Ring Embedding onto the n-cube

The following optional formal proof is included
for those who do not find the preceding argument
convincing
Theorem There is an embedding of a ring with 2n
vertices onto an n-cube that preserves locality.
Proof
We establish the following claim The mapping
f(i) gi(n) is an embedding of the ring onto the
n-cube.
This claim is true for n 1 as G1 (0,1) and
nodes 0 and 1 are connected on both the ring and
1-cube.
We assume above claim is true for a fixed n-1
with n?1.
We use the neighbor-preserving embedding f of the
vertices of a ring with 2n-1 nodes onto the
(n-1)-cube to build a similar embedding f of a
ring with 2n vertices onto the n-cube.

73
Ring Embedding for n-cube (cont)

Recall that gray code for Gn is the sequence
0Gn-1 followed by the sequence 1Gn-1rev
The n-cube consists of one copy of an (n-1)-cube
with a 0 prefix and a second copy of an (n-1)
cube with a 1 prefix.
By assumption that claim is true for n-1, the
gray code sequence 0Gn provides the binary code
for a ring of elements in the (n-1)-cube, with
each successive element differing by 1 digit from
its previous element.
Likewise, the gray code sequence 1Gn-1rev
provides the binary code for a ring of elements
in the second copy of an (n-1)-cube, with each
successive element differing by on digit from its
previous element..
The last element of 0Gn-1 is identical to the
first element of 1Gn-1rev except for the added
digit, so these two elements also differ by one
bit.

74
2D Torus Embedding for n-cube

We embed a 2r ? 2s torus onto an n-cube with
nrs by using the cartesian product Gr ? Gs of
two Gray codes.
A processor with coordinates (i,j) on the grid is
mapped to the processor f(i,j) (gi(r), gi(s) )
in the n-cube.
Recall the map f1(i) gi(r), is an embedding of
a 2r ring onto a r-cube row ring and f2(j)
gi(s) is an embedding of a 2s ring onto a s-cube
column ring.
We must identify (gi(r), gi(s) ) with the nrs
cube where the first r bits are given by gi(r)
and the next s bits are given by gi(s) .
Then for a fixed j, f(i?1,j) are neighbors of
f(i,j) since f1 is an embedding of a 2r ring onto
a r-cube.
Likewise, for a fixed i, f(i, j ?1) are neighbors
of f(i,j) since f2 is an embedding of a 2s ring
onto a s-cube.

75
Collective Communications in Hypercube

Purpose Gain overview of complexity of
collective communications for hypercube.
Will focus on broadcast on hypercube.
Assume processor 0 wants to broadcast. And
consider the naïve algorithm
Processor 0 sends the message to all of its
neighbors
Next, every neighbor sends a message to all of
its neighbors.
Etc.
Redundancy in naïve algorithm
The same processor receives the same message many
times.
E.g., processor 0 receives all its neighbors.
Mismatched SENDS and RECEIVES may happen.

76
Improved Hypercube Broadcast

We seek a strategy where
Each processor receives the message only once
The number of steps is minimal.
Will use one or more spanning trees.
Send and Receive will need parameter of which
dimension the communication takes place
SEND( cube_link, send_addr, m )
RECEIVE(cube_link, send_addr, m )

77
Hypercube Broadcast Algorithm

There are n steps, numbered from n-1 to 0
All processors will receive their msg on the link
corresponding to their rightmost 1
Processors receiving a msg will forward it on
links whose index is smaller than its rightmost
1.
At step i, all processors whose rightmost 1 is
strictly larger than i forwards the msg on link
i.
Let broadcast originates with processor 0
Assume 0 has a fictitious 1 at position n.
This adds an additional digit, as has binary
digits for positions 0,1, , n-1.

78
Trace of Hypercube Broadcast for n4

Since broadcast originates with processor 0, we
assume its index is 10000
Processor 0 sends the broadcast msg on link 3 to
1000
Next, both processor 0000 and 1000 have their
rightmost 1 in position at least three, so both
send a message along link 2
Msg goes to 0100 and 1100, respectively.
This process continues until the last step at
which every even-numbered processor sends its msg
along its link 0.

79
Broadcasting using Spanning Tree
80
Broadcast Algorithm

Let BIT(A,b) denote the value of the bth bit of
processor A.
The algorithm for the broadcast of a message of
length m by processor k is given in Algorithm 3.5
Since there are n steps, the execution time with
the store-and-forward model is n(Lmb).
This algorithm is valid for the 1-port model.
At each step, a processor communicates with at
most one other processor.

81
(No Transcript)
82
Broadcast Algorithm in Hypercube (cont)

Observations about Algorithm Steps
Specifies that the algorithm action is for
processor q
n is the number of binary digits used to label
processors.
pos q XOR k
uses exclusive OR.
Broadcast is from Pk .
Updates pos to work as if P0 is root of broadcast
???
Steps 5-7 sets first-1 to the location of first
1 in pos.
Note Steps 8-10 are the core of algorithm
Phase steps through link dimensions, higher ones
first
If link dimension first-1, q gets message on
first-1 link
Seems to be moving message up to q ???
Q sends message along each smaller dimension than
first-1 to processor below it.