Asynchronous Interconnection Network and

Communication

- Chapter 3 of Casanova, et. al.

Interconnection NetworkTopologies

- The processors in a distributed memory parallel

system are connected using an interconnection

network. - All computers have specialized coprocessors that

route messages and place date in local memories - Nodes consist of a (computing) processor, a

memory, and a communications coprocessor - Nodes are often called processors, when not

ambigious.

Network Topology Types

- Static Topologies
- A fixed network that cannot be changed
- Nodes connected directly to each other by

point-to-point communications links - Dynamic Topologies
- Topology can change at runtime
- One or more nodes can request direct

communication be established between them. - Done using switches

Some Static Topologies

- Fully connected network (or clique)
- Ring
- Two-Dimensional grid
- Torus
- Hypercube
- Fat tree

Examples of Interconnection Topologies

Static Topologies Features

- Fixed number of nodes
- Degree
- Nr of nodes incident to edges
- Distance between nodes
- Length of shortest path between two nodes
- Diameter
- Largest distance between two nodes
- Number of links
- Total number of Edges
- Bisection Width
- Minimum nr. of edges that must be removed to

partition the network into two disconnected

networks of the same size.

Classical Interconnection Networks Features

- Clique (or Fully Connected)
- All processors are connected
- p(p-1)/2 edges
- Ring
- Very simple and very useful topology
- 2D Grid
- Degree of interior processors is 4
- Not symmetric, as edge processors have different

properties - Very useful when computations are local and

communications are between neighbors - Has been heavily used previously

Classical Network

- 2D Torus
- Easily formed from 2D mesh by connecting matching

end points. - Hypercube
- Has been extensively used
- Using recursive defn, can design simple but very

efficient algorithms - Has small diameter that is logarithmic in nr of

edges - Degree and total number of edges grows too

quickly to be useful with massively parallel

machines.

(No Transcript)

Dynamic Topologies

- Fat tree is different than other networks

included - The compute nodes are only at the leaves.
- Nodes at higher level do not perform computation
- Topology is a binary tree both in 2D front view

and in side view. - Provides extra bandwidth near root.
- Used by Thinking Machine Corp. on the CM-5
- Crossbar Switch
- Has p2 switches, which is very expensive for

large p - Can connect n processors to combination of n

processors - .Cost rises with the number of switches, which is

quadratic with number of processors.

(No Transcript)

Dynamic Topologies (cont)

- Benes Network Omega Networks
- Use smaller size crossbars arranged in stages
- Only crossbars in adjacent stages are connected

together. - Called multi-stage networks and are cheaper to

build that full crossbar. - Configuring multi-stage networks is more

difficult than crossbar. - Dynamic networks are now the most common used

topologies.

A Simple Communications Performance Model

- Assume a processor Pi sends a message to Pj or

length m. - Cost to transfer message along a network link is

roughly linear in message length. - Results in cost to transfer message along a

particular route to be roughly linear in m. - Let ci,j(m) denote the time to transfer this

message.

Hockney Performance Model for Communications

- The time ci,j(m) to transfer this message can be

modeled by - ci,j(m) Li,j m/Bi,j Li,j mbi,j
- m is the size of the message
- Li,j is the startup time, also called latency
- Bi,j is the bandwidth, in bytes per second
- bi,j is 1/Bi,j, the inverse of the bandwidth
- Proposed by Hockney in 1994 to evaluate the

performance of the Intel Paragon. - Probably the most commonly used model.

Hockney Performance Model (cont.)

- Factors that Li,j and Bi,j depend on
- Length of route
- Communication protocol used
- Communications software overhead
- Ability to use links in parallel
- Whether links are half or full duplex
- Etc.

Store and Forward Protocol

- SF is a point-to-point protocol
- Each intermediate node receives and stores the

entire message before retransmitting it - Implemented in earliest parallel machines in

which nodes did not have communications

coprocessors. - Intermediate nodes are interrupted to handle

messages and route them towards their destination.

Store and Forward Protocol (cont)

- If d(i,j) is the number of links between Pi Pj,

the formula for ci,j(m) can be re-written as - ci,j(m) d(i,j) L m/B d(i,j)L d(i,j)mb
- where
- L is the initial latency b is the reciprocal

for the broadcast bandwidth for one link. - This protocol produces a poor latency bandwith
- The communication cost can be reduced using

pipelining.

Store and Forward Protocol using Pipelining

- The message is split into r packets of size m/r.
- The packets are sent one after another from Pi to

Pj. - The first packet reaches node j after ci,j(m/r)

time units. - The remaining r-1 packets arrive in
- (r-1) (L mb/r) time units
- Simplifying, total communication time reduces to
- d(i,j) -1rL mb/r
- Casanova, et.al. finds optimal value for r above.

Two Cut-Through Protocols

- Common performance model
- ci,j(m) L d(i,j) ? m/B
- where
- L is the one-time cost of creating a message.
- ? is the routing management overhead
- Generally ? ltlt L as routing management is

performed by hardware while L involve software

overhead - m/B is the time required to transmit the message

through entire route

Circuit-Switching Protocol

- First cut-through protocol
- Route created before first message is sent
- Message sent directly to destination through this

route - The nodes used in this transmission can not be

used during this transmission for any other

communication

Wormhole (WH) Protocol

- A second cut-through protocol
- The destination address is stored in the header

of the message. - Routing is performed dynamically at each node.
- Message is split into small packets called flits
- If two flits arrive at the same time, flits are

stored in intermediate nodes internal registers

Point-to-Point Communication Comparisons

- Store and Forward is not used in physical

networks but only at applications level - Cut-through protocols are more efficient
- Hide distance between nodes
- Avoid large buffer requirement for intermediate

nodes - Almost no message loss
- For small networks, flow-control mechanism not

needed - Wormhole generally preferred to circuit

switching - Latency is normally much lower

(No Transcript)

LogP Model

- Models based on the LogP model are more precise

than the Hockney model - Involves three components of communication the

sender, the network, and the receiver - At times, some of these components may be busy

while others are not. - Some parameters for LogP
- m is the message size (in bytes)
- w is the size of packets message is split into
- L is an upper bound on the latency
- o is the overhead,
- Defined to be the time that the a node is engaged

in the transmission or reception of a packet

LogP Model (cont)

- Parameters for LogP (cont)
- g or gap is the minimal time interval between

consecutive packet transmission or packet

reception - During this time, a node may not use the

communication coprocessor (i.e., network card) - 1/g the communication bandwidth available per

node - P the number of nodes in the platform
- Cost of sending m bytes with packet size w
- Processor occupational time on sender/receiver

(No Transcript)

Other LogP Related Models

- LogP attempts to capture in a few parameters the

characteristics of parallel platforms. - Platforms are fine-tuned and may use different

protocols for short long messages - LogGP is an extension of LogP where G captures

the bandwidth for long messages - pLogP is an extension of LogP where L, o, and g

depend on the message size m. - Also seperates sender overhead os and receiver

overhead or.

Affine Models

- The use of the floor functions in LogP models

causes them to be nonlinear. - Causes many problems in analytic theoretical

studies. - Has resulted in proposal of many fully linear

models - The time that Pi is busy sending a message is

expressed as an affine function of the message

size - An affine function of m has form f(m) am b

where a and b are constants. If b0, then f is

linear function - Similarly, the time Pj is busy receiving the

message is expressed as an affine function of the

message size - We will postpone further coverage of the topic of

affine models for the present

Modeling Concurrent Communications

- Multi-port model
- Assumes that communications are contention-free

and do not interfere with each other. - A consequence is that a node may communicate with

an unlimited number of nodes without any

degradation in performance. - Would require a clique interconnection network to

fully support. - May simplify proofs that certain problems are

hard - If hard under ideal communications conditions,

then hard in general. - Assumption not realistic - communication

resources are always limited. - See Casanova text for additional information.

Concurrent Communications Models (2/5)

- Bounded Multi-port model
- Proposed by Hong and Prasanna
- For applications that uses threads (e.g., on a

multi-core technology), the network link can be

shared by several incoming and outgoing

communications. - The sum of bandwidths allocated by operating

system to all communications can not exceed

bandwidth of network card. - An unbounded nr of communications can take place

if they share the total available bandwidth. - The bandwidth defines the bandwidth allotted to

each communication - Bandwidth sharing by application is unusual, as

is usually handled by operating system.

Concurrent Communications Models (3/5)

- 1 port (unidirectional or half-duplex) model
- Avoids unrealistically optimistic assumptions
- Forbids concurrent communication at a node.
- A node can either send data or receive it, but

not simultaneously. - This model is very pessimistic, as real world

platforms can achieve some concurrent

computations. - Model is simple and is easy to design algorithms

that follow this model.

Concurrent Communications Models (4/5)

- 1 port (bidirectional or full-duplex) model
- Currently, most network cards are full-duplex.
- Allows a single emission and single reception

simultaneously. - Introduced by Blat, et. al.
- Current hardware does not easily enable multiple

messages to be transmitted simultaneous. - Multiple sends and receives are claimed to be

eventually serialized by a single hardware port

to the next. - Saif Parashar did experimental work that

suggests asynchronous sends become serialized

when message sizes exceed a few megabytes.

Concurrent Communications Models (5/5)

- k-ports model
- A node may have kgt1 network cards
- This model allows a node to be involved in a

maximum of one emission and one reception on each

network card. - This model is used in Chapters 4 5.

Bandwidth Sharing

- The previous concurrent communication models only

consider contention on nodes - Other parts of the network can also limit

performance - It may be useful to determine constraints on each

network link - This type of network model are useful for

performance evaluation purposes, but are too

complicated for algorithm design purposes. - Casanova text evalutes algorithms using 2 models
- Hockney model or even simplified versions (e.g.

assuming no latency) - Multi-port (ignoring contention) or the 1 port

model.

Case Study Unidirectional Ring

- We first consider the platform of p processors

arranged in a unidirectional ring. - Processors are denoted Pk for k 0, 1, , p-1.
- Each PE can find its logical index by calling

My_Num().

Unidirectional Ring Basics

- The processor can determine the number of PEs by

calling NumProcs() - Both preceding commands are supported in MPI, a

language implemented on most asychronous systems. - Each processor has its own memory
- All processors execute the same program, which

acts on data in their local memories - Single Program, Multiple Data or SPMD
- Processors communicate by message passing
- explicitly sending and receiving messages.

Unidirectional Ring Basics (cont 2/5)

- A processor sends a message using the function
- send(addr,m)
- addr is the memory address (in the sender

process) of first data item to be sent. - m is the message length (i.e., nr of items to be

sent) - A processor receives a message using function
- receive(addr,m)
- The addr is the local address in receiving

processor where first data item is to be stored. - If processor Pi executes a receive, then its

predecessor (P(i-1)mod p) must execute a send. - Since each processor has a unique predecessor and

successor, they do not have to be specified

Unidirectional Ring Basics (cont 3/5)

- A restrictive assumption is to assume that both

the send and receive is blocking. - Then the participating processes can not continue

until the communication is complete. - The blocking assumption is typical of 1st

generation platforms - A classical assumption is keep the receive

blocking but to allow the send is non-blocking - The processor executing a send can continue while

the data transfer takes place. - To implement, one function is used to initiate

the send and another function is used to

determine when communication is finished.

Unidirectional Ring Basics (cont 4/5)

- In algorithms, we simply indicate the blocking

and non-blocking operations - More recent proposed assumption is that a single

processor can send data, receive data, and

compute simultaneously. - All three can occur only if no race condition

exists. - Convenient to think of three logical threads of

control running on a processor - One for computing
- One for sending data
- One for receiving data
- We will usually use the less restrictive third

assumption

Unidirectional Ring Basics (cont 4/5)

- Timings for Send/Receive
- We use a simplified version of the Hockney model
- The time to send or receive over one link is
- c(m) L mb
- m is the length of the message
- L is the startup cost in seconds due to the

physical latency and the software overhead - b is the inverse of the data transfer rate.

The Broadcast Operation

- The broadcast operation allows an processor Pk to

send the same message of length m to all other

processors. - At the beginning of the broadcast operation, the

message is stored at the address addr in the

memory of the sending process, Pk. - At the end of the broadcast, the message will be

stored at address addr in the memory of all

processors. - All processors must call the following function
- Broadcast(k, addr, m)

Broadcast Algorithm Overview

- The message will go around the ring from

processor - from Pk to Pk1 to Pk2 to to

Pk-1. - We will assume the processor numbers are modulo

p, where p is the number of processors. For

example, if k0 and p8, then k-1 p-1 7. - Note there is no parallelism in this algorithm,

since the message advances around ring only one

processor per round. - The predecessor of Pk (i.e, Pk-1) does not send

the message to Pk.

(No Transcript)

Analysis of Broadcast Algorithm

- For algorithm to be correct, the receive in

Step 10 will execute before Step 11. - Running Time
- Since we have a sequence of p-1 communications,

the time to broadcast a message of length m is - (p-1)(Lmb)
- MPI does not typically use ring topology for

creating communication primitives - Instead use various tree topologies that are more

efficient on modern parallel computer platforms. - However, these primitives are simpler on a ring.
- Prepares readers to implement primitives, when

more efficient than using MPI primitives.

Scatter Algorithm

- Scatter operation allows Pk to send a different

message of length m to each processor. - Initially, Pk holds a message of length m to be

sent to Pq at location addr q. - To keep the array of addresses uniform, space for

a message to Pk is also provided. - At the end of the algorithm, each processor

stores its message from Pk at location msg. - The efficient way to implement this algorithm is

to pipeline the messages. - Message to most distant processor (i.e,, Pk-1) is

followed by message to processor Pk-2.

(No Transcript)

Discussion of Scatter Algorithm

- In Steps 5-6, Pk successively send messages to

the other p-1processors in the order of their

distance from Pk . - In Step 7, Pk stores its message to itself.
- The other processors concurrently move messages

along as they arrive in steps 9-12. - Each processor uses two buffers with addresses

tempS and tempR. - This allows processors to send a message and to

receive the next message in parallel in Step12.

Discussion of Scatter Algorithm (cont)

- In step 11, tempS ? tempR means two addresses are

switched so received value can be sent to next

processor. - When a processor receives its message from Pk,

the processor stops forwarding (Step 10). - Whatever is in the receive buffer, tempR, at the

end is stored as its message from Pk (Step 13). - The running time of the scatter algorithm is the

same as for the broadcast, namely - (p-1)(Lmb)

Example for Scatter Algorithm

- Example In Figure 3.7, let p6 and k4.
- Steps 5-6 For i 1 to p-1 do
- send(addr(kp-i) mod p, m)
- Let PE (kp-i) mod p (10 i) mod 6
- For i1, PE 9 mod 6 3
- For i2, PE 8 mod 6 2
- For i3, PE 7 mod 6 1
- For i4, PE 6 mod 6 0
- For i5, PE 5 mod 6 5
- Note messages are sent to processors in the order

3, 2, 1, 0, 5 - That is, messages to most distant processors sent

first.

Example for Scatter Algorithm (cont)

- Example In Figure 3.7, let p6 and k4.
- Steps 10 For i 1 to (k-1-q) mod p do
- Compute (k-1-q) mod p (3-q) mod 6 for all q.
- Note q? k, which is 4
- q 5 ? i 1 to 4 since (3-5) mod 6 4
- PE 5 forwards values in loop from i 1 to 4
- q 0 ? i 1 to 3 since (3-0) mod 6 3
- PE 0 forwards values from i 1 to 3
- q 1 ? i 1 to 2 since (3-1) mod 6 2
- PE 1 forwards values from i 1 to 2
- q 2 ? i 1 to 1 since (3-2) mod 6 1
- PE 2 is active in loop when i 1

Example for Scatter Algorithm (cont)

- q 3 ? i 1 to 0 since (3-3) mod 6 0
- PE 3 precedes PE k, so it never forwards a value
- However, it receives and stores a message at 9
- Note that in step 9, all processors store the

first message they receive. - That means even the processor k-1 receives a

value to store.

All-to-All Algorithm

- This command allows all p processors to

simultaneously broadcast a message (to all PEs) - Again, it is assumed all messages have length m
- At the beginning, each processor holds the

message it wishes to broadcast at address

my_message. - At the end, each processor will hold an array

addr of p messages, where addrk holds message

from Pk. - Using pipelining, the running time is the same as

for a single broadcast, namely - (p-1)(Lmb)

(No Transcript)

Gossip Algorithm

- Last of the classical collection of communication

operations - Each processor sends a different message to each

processor. - Gossip algorithm is Problem 3.7 in textbook.
- Note it will take 1 step for each PE to send a

message to its closest neighbor using all links. - It will take 2 steps for each PE to send a

message to its 2nd closest neighbor, using all

links. - In general, it will take 12 (p-1) (p-1)!

steps for each PE to send messages to all other

nodes, using all links of the network during each

step. - Complexity is (p-1)! (p-1)(p-2)/2 O(p2).

Pipelined Broadcast by kth processor

- Longer messages can be broadcast faster if they

are broken into smaller pieces - Suppose they are broken into r pieces of the same

length. - The sender sends the pieces out in order, and has

them travel simultaneously on the ring. - Initially, the pieces are stored at addresses

addr0, addr1, , addrr-1. - At the end, all pieces are stored in all

processors. - At each step, when a processor receives a message

piece, it also forwards the piece it previously

received, if any, to its successor.

(No Transcript)

Pipelined Broadcast (cont)

- There must be p-1 communication steps for first

piece to reach the last processor, Pk-1. - Then it takes r-1 time for the rest of the pieces

to reach Pk-1. - The required time is (pr-2) ( L mb/r)
- The value of r that minimizes this expression can

be found by setting its derivative (with respect

to r) to zero and solving for r. - For large m, the time required tends to mb
- Does not depend on p.
- Compares well to broadcast time (p-1)(Lmb)

Hypercube

- Defn A 0-cube consists of a single vertex. For

ngt0, an n-cube consists of two identical

(n-1)-cubes with edges added to join matching

pairs of vertices in the two (n-1) cubes.

Hypercubes (cont)

- Equivalent defn An n-cube is a graph consisting

of 2n vertices from 0 to 2n -1 such that two

vertices are connected if and only if their

binary representation differs by one single bit. - Property The diameter and degree of an n-cube

are equal to n. - Proof is left for reader. Easy if use recursion.
- Hamming Distance Let A and B be two points in an

n-cube. H(A,B) is the number of bits that differ

between the binary labels of A and B. - Notation If b is a binary bit, then let b 1-b,

the bit complement of b.

Hypercube Paths

- Using binary representation, let
- A an-1 an-2 a2 a1 a0
- B bn-1 bn-2 b2 b1 b0
- WLOG, assume that A and B have different bits

exactly in their last k bits. - Differing bits at end makes numbering them easier
- Then a pathway from A to B can be created by the

following sequence of nodes - A an-1 an-2 a2 a1 a0
- Vertex 1 an-1 an-2 a2 a1 a0
- Vertex 2 an-1 an-2 a2 a1 a0
- ........
- B Vertex k an-1 an-2 ak ak-1 a2 a1 a0

Hypercube Paths (cont)

- Independent of which bits of A and B agree, there

are k choices for first bit to flip, (k-1)

choices for next bit to flip, etc. - This gives k! different paths from A to B.
- How many independent paths exist from A to B?
- I.e., paths with only A and B as common vertices.
- Theorem If A and B are n-cube vertices that

differ in k bits, then there exist exactly k

independent paths from A to B. - Proof First, we show k independent paths exist.
- We build an independent path for each j with

0?jltk

Hypercube Paths (cont)

- Let P(j, j-1, j-2, , 0, k-1, k-2, , j1)

denote the path from A to B with following

sequence of nodes - A an-1 an-2 ak ak-1 ak-2 aj aj-1 aj-2 a2

a1 a0 - V(1) an-1 an-2 ak ak-1 ak-2 aj aj-1 aj-2

a2 a1 a0 - V(2) an-1 an-2 ak ak-1 ak-2 aj aj-1 aj-2

a2 a1 a0 - .........
- V(j1) an-1 an-2 ak ak-1 ak-2 aj aj-1 aj-2

a2 a1 a0 - V(j2) an-1 an-2 ak ak-1 ak-2 aj aj-1 aj-2

a2 a1 a0 - .............
- V(k-1) an-1 an-2 ak ak-1 ak1 aj1aj aj-1

aj-2 a2 a1 a0 - B V(k)an-1 an-2 ak ak-1 ak1 aj1aj aj-1

aj-2 a2 a1 a0

Hypercube Pathways (cont)

- Suppose the following two path have a common

vertex X other than A and B. - P(j, j-1, j-2, , 0, k-1, k-2, , j1)
- P(t, t-1, t-2, , 0, k-1, k-2, , t1)
- Since paths are different, and A and B differ by

k bits, we may assume 0tltjltk - Let A and X differ in q bits
- To travel from A to X along either path, exactly

q bits in circular sequence have been flipped, in

a left to right order for each - 1st path j, j-1 t, t-1 0, k-1, k-2 j-1
- 2nd path t, t-1 0, k-1, k-2

j-1 t-1 - This is impossible, as the q bits are flipped

along each path can not be exactly the same bits.

Hypercube Paths (cont)

- Finally, there can not be another independent

path Q from A to B (i.e., without another common

vertex) - If so, the first node in path following A would

have to flipped one bit, say bit q, to agree

with B. - But then, the path described earlier that flipped

bit q first would have a common interior vertex

with this path Q.

Hypercube Routing

- XOR is an exclusive OR. Exactly one input must be

a 1. - To design a route from A to B in the n-cube, we

will use the algorithm that will always flip the

rightmost bit that disagrees with B. - The NOR of the binary representation of A and B

indicates by 1s the bits that have to be

flipped. - For the 5-cube, if A is 10111 and B is 01110,then
- A NOR B is 11001, and the routing is as follows
- A 10111 ? 10110 ? 11110 ? 01110 B
- This algorithm can be executed as follows
- A NOR B 10111 NOR 01110 11001, so A routes

the message along link 1 (the digits link) to

Node A1 10110 - A1 NOR B 10110 NOR 01110 11000, so Node A1

routes message along link 4 to Node A2 10110 - A2 NOR B 10110 NOR 01110 10000, so A2

routes message along link 5 to B

Hypercube Routing (cont)

- This routing algorithm can be used to implement a

wormhole or cut-through protocol in hardware. - Problem If another pair of processors have

already reserved one link on the desired path,

the message may stall until the end of the other

communication. - Solution Since there are multiple paths, the

routers select links to use based on a link

reservation table and message labels. - In our example, if link 1 in Node A is busy, then

instead use link 4 to forward the message to new

node 11111 that is on a new path to B. - If at some point, the current vertex determines

there is no useful links available, then current

vertex will have to wait for a useful link to

become available - If at some vertex, the desired links are not

available, then algorithm could use a link that

extends path length

Gray Code

- Recursive construction of Gray code
- G1 (0,1) and has 21 2 elements
- G2 (00,01,11, 10) and had 22elements
- G3 (000, 001, 011, 010, 110, 111, 101, 100)

has 23 elements - etc.
- Gray code for dimension n ? 1 is denoted Gn, and

is defined recursively. G1 (0,1) and for ngt1,

Gn is the sequence 0Gn-1 followed by the sequence

1Gn-1rev (i.e., ) where - xGn-1 is the sequence obtained by prefixing every

element of G with x - Grev is the sequence obtained by listing the

elements of G in the reverse order - .

Gray Code (cont.)

- Since we assume Gn-1 has 2n-1 elements and G n

has exactly two copies of Gn-1 , it has 2n

elements. - Summary The Gray code Gn is an ordered sequence

of all the 2n binary codes with n digits whose

successive values differ from each other by

exactly one bit. - Notation Let gi(r) denote the ith element of the

Gray code of dimension r. - Observation Since the Gray code Gn g1(n),

g2(n), , g(n) form a ordered sequence of names

for all of the nodes in a 2n ring.

Embeddings

- Defn An embedding of a topology (e.g., ring, 2D

mesh, etc) into an n-cube is a 1-1 function f

from the vertices of the topology into the

n-cube. - An embedding is said to preserve locality if the

image of any two neighbors are also neighbors in

the n-cube - If embeddings do not preserve locality, then we

try to minimize the distance of neighbors in the

hypercube. - An embedding is said to be onto if the embedding

function f has its range to be the entire n-cube.

A 2n Ring embedding onto n-cube

- Theorem There is an embedding of a ring with 2n

vertices onto an n-cube that preserves locality. - Proof
- Our construction of the Gray code provides an

ordered sequence of binary values with n digits

that can be used as names of the nodes of the

ring with 2n vertices - The first name (e.g., 000) can be used to name

any node in ring. - The sequence of Gray code names are given names

to ring nodes successively, in a clockwise or

counter-clockwise order

Embedding Ring onto n-cube (cont)

- The gray code binary numbers are identical with

the names assigned to hypercube nodes. - Two successive n-cube nodes are connected since

they only differ by one binary digi - So the embedding is a result of the Gray code

providing an ordered sequence of n-cube names

that is used to provide successive names the 2n

ring nodes. - This concludes the proof that this embedding

follows from construction of Gray code. - However, the following formal proof provides

more details and an after construction

argument..

A 2n Ring Embedding onto the n-cube

- The following optional formal proof is included

for those who do not find the preceding argument

convincing - Theorem There is an embedding of a ring with 2n

vertices onto an n-cube that preserves locality. - Proof
- We establish the following claim The mapping

f(i) gi(n) is an embedding of the ring onto the

n-cube. - This claim is true for n 1 as G1 (0,1) and

nodes 0 and 1 are connected on both the ring and

1-cube. - We assume above claim is true for a fixed n-1

with n?1. - We use the neighbor-preserving embedding f of the

vertices of a ring with 2n-1 nodes onto the

(n-1)-cube to build a similar embedding f of a

ring with 2n vertices onto the n-cube.

Ring Embedding for n-cube (cont)

- Recall that gray code for Gn is the sequence

0Gn-1 followed by the sequence 1Gn-1rev - The n-cube consists of one copy of an (n-1)-cube

with a 0 prefix and a second copy of an (n-1)

cube with a 1 prefix. - By assumption that claim is true for n-1, the

gray code sequence 0Gn provides the binary code

for a ring of elements in the (n-1)-cube, with

each successive element differing by 1 digit from

its previous element. - Likewise, the gray code sequence 1Gn-1rev

provides the binary code for a ring of elements

in the second copy of an (n-1)-cube, with each

successive element differing by on digit from its

previous element.. - The last element of 0Gn-1 is identical to the

first element of 1Gn-1rev except for the added

digit, so these two elements also differ by one

bit.

2D Torus Embedding for n-cube

- We embed a 2r ? 2s torus onto an n-cube with

nrs by using the cartesian product Gr ? Gs of

two Gray codes. - A processor with coordinates (i,j) on the grid is

mapped to the processor f(i,j) (gi(r), gi(s) )

in the n-cube. - Recall the map f1(i) gi(r), is an embedding of

a 2r ring onto a r-cube row ring and f2(j)

gi(s) is an embedding of a 2s ring onto a s-cube

column ring. - We must identify (gi(r), gi(s) ) with the nrs

cube where the first r bits are given by gi(r)

and the next s bits are given by gi(s) . - Then for a fixed j, f(i?1,j) are neighbors of

f(i,j) since f1 is an embedding of a 2r ring onto

a r-cube. - Likewise, for a fixed i, f(i, j ?1) are neighbors

of f(i,j) since f2 is an embedding of a 2s ring

onto a s-cube.

Collective Communications in Hypercube

- Purpose Gain overview of complexity of

collective communications for hypercube. - Will focus on broadcast on hypercube.
- Assume processor 0 wants to broadcast. And

consider the naïve algorithm - Processor 0 sends the message to all of its

neighbors - Next, every neighbor sends a message to all of

its neighbors. - Etc.
- Redundancy in naïve algorithm
- The same processor receives the same message many

times. - E.g., processor 0 receives all its neighbors.
- Mismatched SENDS and RECEIVES may happen.

Improved Hypercube Broadcast

- We seek a strategy where
- Each processor receives the message only once
- The number of steps is minimal.
- Will use one or more spanning trees.
- Send and Receive will need parameter of which

dimension the communication takes place - SEND( cube_link, send_addr, m )
- RECEIVE(cube_link, send_addr, m )

Hypercube Broadcast Algorithm

- There are n steps, numbered from n-1 to 0
- All processors will receive their msg on the link

corresponding to their rightmost 1 - Processors receiving a msg will forward it on

links whose index is smaller than its rightmost

1. - At step i, all processors whose rightmost 1 is

strictly larger than i forwards the msg on link

i. - Let broadcast originates with processor 0
- Assume 0 has a fictitious 1 at position n.
- This adds an additional digit, as has binary

digits for positions 0,1, , n-1.

Trace of Hypercube Broadcast for n4

- Since broadcast originates with processor 0, we

assume its index is 10000 - Processor 0 sends the broadcast msg on link 3 to

1000 - Next, both processor 0000 and 1000 have their

rightmost 1 in position at least three, so both

send a message along link 2 - Msg goes to 0100 and 1100, respectively.
- This process continues until the last step at

which every even-numbered processor sends its msg

along its link 0.

Broadcasting using Spanning Tree

Broadcast Algorithm

- Let BIT(A,b) denote the value of the bth bit of

processor A. - The algorithm for the broadcast of a message of

length m by processor k is given in Algorithm 3.5 - Since there are n steps, the execution time with

the store-and-forward model is n(Lmb). - This algorithm is valid for the 1-port model.
- At each step, a processor communicates with at

most one other processor.

(No Transcript)

Broadcast Algorithm in Hypercube (cont)

- Observations about Algorithm Steps
- Specifies that the algorithm action is for

processor q - n is the number of binary digits used to label

processors. - pos q XOR k
- uses exclusive OR.
- Broadcast is from Pk .
- Updates pos to work as if P0 is root of broadcast

??? - Steps 5-7 sets first-1 to the location of first

1 in pos. - Note Steps 8-10 are the core of algorithm
- Phase steps through link dimensions, higher ones

first - If link dimension first-1, q gets message on

first-1 link - Seems to be moving message up to q ???
- Q sends message along each smaller dimension than

first-1 to processor below it.

(No Transcript)