Advanced Computer Architecture CSE 8383 - PowerPoint PPT Presentation

Loading...

PPT – Advanced Computer Architecture CSE 8383 PowerPoint presentation | free to download - id: 5d690e-ODAxZ



Loading


The Adobe Flash plugin is needed to view this content

Get the plugin now

View by Category
About This Presentation
Title:

Advanced Computer Architecture CSE 8383

Description:

Advanced Computer Architecture CSE 8383 February 7 2008 Session 4 Contents Group Work Dependence Analysis Instruction Pipelines and hazards (revisit) ILP ... – PowerPoint PPT presentation

Number of Views:181
Avg rating:3.0/5.0
Slides: 132
Provided by: Adm693
Learn more at: http://lyle.smu.edu
Category:

less

Write a Comment
User Comments (0)
Transcript and Presenter's Notes

Title: Advanced Computer Architecture CSE 8383


1
Advanced Computer ArchitectureCSE 8383
February 7 2008 Session 4
2
Contents
  • Group Work
  • Dependence Analysis
  • Instruction Pipelines and hazards (revisit)
  • ILP
  • Multithreading
  • Multiprocessors

3
Group Activity
1 2 3 4
S1 X X
S2 X
S3 X
C.V
State Diagram
Simple Cycles
Throughput (t 20 ns)
MAL
Greedy Cycles
4
Dependence Analysis
5
Types of Dependencies
  • Name dependencies
  • Output dependence
  • Anti-dependence
  • Data True dependence
  • Control Dependence
  • Resource Dependence

6
Name dependences
  • Output dependence
  • When instruction I and J write the same register
    or memory location. The ordering must be
    preserved to leave the correct value in the
    register
  • i add r7,r4,r3
  • j div r7,r2,r8
  • Anti-dependence
  • When instruction j writes a register or memory
    location that instruction i reads
  • i add r6,r5,r4
  • j sub r5,r8,r11

7
Data Dependences
  • An instruction j is data dependent on instruction
    i if either of the following hold
  • instruction i produces a result that may be used
    by instruction j , or
  • instruction j is data dependent on instruction k,
    and instruction k is data dependent on
    instruction i

i
i add r6,r5,r4 j sub r1,r6,r11
r6
j
8
Control Dependences
  • A control dependence determines the ordering of
    an instruction i, with respect to a branch
    instruction so that the instruction i is executed
    in correct program order.
  • If p1
  • S1
  • If p2
  • S2

9
Resource dependences
  • An instruction is resource-dependent on a
    previously issued instruction if it requires a
    hardware resource which is still being used by a
    previously issued instruction.
  • div r1, r2, r3
  • div r4, r2, r5

10
Removing name dependences (Register renaming)
  • Read-Write dependency (anti)
  • DIV.D F0, F1, F2 (I1)
  • ADD.D F3, F0, F4 (I2)
  • SUB.D F4, F5, F6 (I3)
  • MUL.D F3, F5, F4 (I4)
  • I3 can not complete before I2 starts as I2 needs
    a value in F4 and I3 changes F4
  • Remember? An anti-dependence exists if an
    instruction uses a location as an operand while a
    following one is writing into that location
  • if the first one is still using the location when
    the second writes into it, an error occurs

11
Register Renaming
  • Output dependencies and anti-dependencies can be
    treated similar to true data dependencies as
    normal conflicts, by delaying the execution of a
    certain instruction until it can be executed
  • Parallelism could be improved by eliminating
    output dependencies and anti-dependencies, which
    are not real data dependencies
  • These artificial dependencies can be eliminated
    by automatically allocating new registers to
    values, when such dependencies has been detected
  • This technique is called register renaming

12
Register Renaming
  • DIV.D F0, F1, F2 DIV.D
    F0, F1, F2
  • ADD.D F3, F0, F4 ADD.D
    F3, F0, F4
  • SUB.D F4, F5, F6 SUB.D T,
    F5, F6
  • MUL.D F3, F5, F4 MUL.D S,
    F5, T

13
Instruction Pipelines and Hazards
14
Linear Instruction Pipelines
  • Assume the following instruction execution
    phases
  • Fetch (F)
  • Decode (D)
  • Operand Fetch (O)
  • Execute (E)
  • Write results (W)

15
Pipeline Instruction Execution
I1 I2 I3
I1 I2 I3
I1 I2 I3
I1 I2 I3
I1 I2 I3
F
D
O
E
W
Time
16
Pipeline Execution
Cycles
F D O E W
I1
F D O E W
I2
F D O E W
I3
F D O E W
I4
F D O E W
I5
17
Superscalar Execution (sneak preview)
Cycles
F D O E W
I1
F D O E W
I2
F D O E W
I3
F D O E W
I4
F D O E W
I5
F D O E W
I6
18
Pipeline Hazards
  • CPI Ideal pipeline CPI Structural Stalls
    Data Hazard Stalls Control Stalls
  • Ideal pipeline CPI Maximum performance
    attainable by the implementation
  • Structural hazards HW cannot support this
    combination of instructions
  • Data hazards Instruction depends on result of
    prior instruction still in the pipeline
  • Control hazards Caused by delay between the
    fetching of instructions and decisions about
    changes in control flow (branches and jumps)

19
Solutions
  • Structural Hazards solutions
  • Have as many functional units as needed
  • Data Hazards solutions
  • Execute instructions in order. Use score-board to
    eliminate data hazards by stalling instructions
  • Execute instructions out or order, as soon as
    operands are available, but graduate them in
    order.
  • Use register renaming to avoid WAR and WAW data
    hazards
  • Control Hazards solutions
  • Use branch prediction
  • Make sure that the branch is resolved before
    registers are modified

20
ILP Architecture
21
ILP Architectures
  • Computer Architecture is a contract (instruction
    format and the interpretation of the bits that
    constitute an instruction) between the class of
    programs that are written for the architecture
    and the set of processor implementations of that
    architecture.
  • In ILP Architectures information embedded in
    the program pertaining to available parallelism
    among instructions and operations in the program

22
ILP Architectures Classifications
  • Sequential Architectures the program is not
    expected to convey any explicit information
    regarding parallelism. (Superscalar processors)
  • Dependence Architectures the program explicitly
    indicates the dependences that exist between
    operations (Dataflow processors)
  • Independence Architectures the program provides
    information as to which operations are
    independent of one another. (VLIW processors)

23
Sequential Architecture and Superscalar Processors
  • Program contains no explicit information
    regarding dependencies that exist between
    instructions
  • Dependencies between instructions must be
    determined by the hardware
  • Compiler may re-order instructions to facilitate
    the hardwares task of extracting parallelism

24
Superscalar Processors
  • Superscalar processors attempt to issue multiple
    instructions per cycle
  • Essential dependencies are specified by
    sequential ordering so operations must be
    processed in sequential order
  • Could be a performance bottleneck

25
Dependence architecture and Dataflow Processors
  • The compiler (programmer) identifies the
    parallelism in the program and communicates it to
    the hardware (specify the dependences between
    operations)
  • The hardware determines at run-time when each
    operation is independent from others and perform
    scheduling
  • Objective execute the instruction at the
    earliest possible time (once input operands and
    functional units are available).

26
Dataflow Processors
  • Dataflow processors are representatives of
    Dependence architectures
  • Execute instruction at earliest possible time
    subject to availability of input operands and
    functional units
  • Dependencies communicated by providing with each
    instruction a list of all successor instructions
  • As soon as all input operands of an instruction
    are available, the hardware fetches the
    instruction
  • Few Dataflow processors currently exist

27
Independence Architecture and VLIW Processors
  • By knowing which operations are independent, the
    hardware needs no further checking to determine
    which instructions can be issued in the same
    cycle
  • The set of independent operations gtgt the set of
    dependent operations
  • Only a subset of independent operations are
    specified
  • The compiler may additionally specify on which
    functional unit and in which cycle an operation
    is executed
  • The hardware needs to make no run-time decisions

28
VLIW processors
  • Operation versus Instruction
  • Operation is a unit of computation (add, load,
    branch instruction in sequential architecture)
  • Instruction set of operations that are intended
    to be issued simultaneously
  • Compiler decides which operation to go to each
    instruction (scheduling)
  • All operations that are supposed to begin at the
    same time are packaged into a single VLIW
    instruction

29
VLIW strengths
  • In hardware it is very simple
  • consisting of a collection of function units
    (adders, multipliers, etc.) connected by a bus,
    plus some registers and caches
  • More silicon goes to the actual processing
    (rather than being spent on branch prediction,
    for example)
  • It should run fast, as the only limit is the
    latency of the function units themselves.
  • Programming a VLIW chip is very much like writing
    microcode

30
VLIW limitations
  • The need for a powerful compiler
  • Increased code size arising from aggressive
    scheduling policies
  • Larger memory bandwidth and register-file
    bandwidth
  • Limitations due to the lock-step operation,
    binary compatibility across implementations with
    varying number of functional units and latencies

31
Summary ILP Architectures
Sequential Architecture Dependence Architecture Independence Architectures
Additional info required in the program None Specification of dependences between operations A list of independences. A complete specification of when and where each operation to be executed
Typical kind of ILP processor Superscalar Dataflow VLIW
Dependence analysis Performed by HW Performed by compiler Performed by compiler
Scheduling Performed by HW Performed by HW Performed by compiler
32
ILP Scheduling
  • Static Scheduling boosted by parallel code
    optimization

Dynamic Scheduling without static parallel code
optimization
Dynamic Scheduling boosted by static parallel
code optimization
  • done by the compiler
  • The processor receives dependency-free and
    optimized code for parallel execution
  • Typical for VLIWs and a few pipelined processors
    (e.g. MIPS)
  • done by the processor
  • The code is not optimized for parallel execution.
    The processor detects and resolves dependencies
    on its own
  • Early ILP processors (e.g. CDC 6600, IBM 360/91
    etc.)
  • done by processor in conjunction with parallel
    optimizing compiler
  • The processor receives optimized code for
    parallel execution, but it detects and resolves
    dependencies on its own
  • Usual practice for pipelined and superscalar
    processors (e.g. RS6000)

33
Superscalar
34
What is Superscalar?
  • A machine designed to improve the performance of
    scalar instructions where one instruction per
    cycle
  • Superscalar architecture allows several
    instructions to be issued and completed per clock
    cycle
  • It consists of a number of pipelines that are
    working in parallel.
  • Common instructions (arithmetic, load/store,
    conditional branch) can be initiated and executed
    independently in different pipelines. They are
    executed in an order different from the program
    order
  • Equally applicable to RISC CISC, In practice
    usually RISC

35
Pipelined Execution
36
Superscalar Execution
37
How Does it Work?
  • instruction fetch
  • fetching of multiple instructions at once
  • dynamic branch prediction fetching beyond
    branches
  • instruction issue
  • methods for determining which instructions can be
    issued
  • the ability to issue multiple instructions in
    parallel
  • instruction commit
  • methods for committing several instructions in
    fetch order
  • duplicate more complex hardware

38
Superscalar Execution Example
Data Flow
  • Assumptions
  • Single FP adder takes 2 cycles
  • Single FP multiplier takes 5 cycles
  • Can issue add multiply together
  • Must issue in-order

Critical Path 9 cycles
(Single adder, data dependence)
(In order)
v addt f10, f2, f4 w mult f10, f10,
f6 x addt f12, f10, f8 y addt f4, f4,
f6 z addt f10, f4, f8
13
39
Out of Order Issue
  • Can start y as soon as an adder is available
  • Must hold back z until f10 not used adder
    available

v addt f10, f2, f4 w mult f10, f10,
f6 x addt f12, f10, f8 y addt f4, f4,
f6 z addt f10, f4, f8
Critical Path 9 cycles
11
40
With Register Renaming
v addt f10a, f2, f4 w mult f10a, f10a,
f6 x addt f12, f10a, f8 y addt f4, f4,
f6 z addt f10, f4, f8
Critical Path 9 cycles
9
41
Instruction Issue Policy
  • Instruction Issue Policy refers to the protocol
    used to issue instructions
  • The three types of ordering are
  • Order in which instructions are fetched
  • Order in which instructions are executed
  • Order in which instructions change registers and
    memory

42
Instruction Issue Policy
  • The simplest policy is to execute and complete
    instruction in their sequential order
  • To improve parallelism, the processor has to look
    ahead and try to find independent instructions to
    execute in parallel
  • Execution policies
  • i. In-order issue with in-order completion
  • ii. In-order issue with out-order completion
  • iii. Out-of-order issue with out-of-order
    completion

43
In-Order Issue with In-Order Completion
  • Instructions are issued in the exact order that
    would correspond to sequential execution
    In-order Issue and result are written in the
    same order In-order Completion

44
In-Order Issue with Out-of-Order Completion
  • Result are written in different order
  • Output dependency
  • R3 R3 R5 (I1)
  • R4 R3 1 (I2)
  • R3 R5 1 (I3)
  • R7 R3 R4 (I4)
  • If I3 completes before I1, the result from I1
    will be wrong but with register renaming I3 can
    be completed out of order.

45
Out-of-Order Issue with Out-of-Order Completion
  • With in-order issue, no new instruction can be
    issued when processor has detected a conflict and
    is stalled, until after the conflict has been
    resolved
  • The processor is not allowed to look ahead for
    further instructions, which could be executed in
    parallel
  • Out-of-order issue tries to resolve the above
    problem by taking a set of decoded instructions
    into an instruction window (buffer)
  • When a functional unit becomes available, an
    instruction from the window may be issued to the
    execute stage
  • Any instruction may be issued, provided that
  • it needs a particular functional unit that is
    available
  • no conflict or dependencies blocking this
    instruction

46
Execution Example
  • Assumptions
  • Two-way issue with renaming, rename registers
    B1,B2, etc.,1 cycle ADD.D latency, 2 cycles MUL.D

v ADD.D f10, f2, f4 w MUL.D f10 f10, f6
x ADD.D f12, f10, f8 y ADD.D f4, f4, f6
47
Cycle 1
  • v and w issued
  • v w targets set to B1 B2

v ADD.D f10, f2, f4 w MUL.D f10 f10, f6
x ADD.D f12, f10, f8 y ADD.D f4, f4, f6
48
Cycle 2
  • x and y issued
  • v w targets set to B1 B2

v ADD.D f10, f2, f4 w MUL.D f10 f10, f6
x ADD.D f12, f10, f8 y ADD.D f4, f4, f6
49
Cycle 3
  • Instruction v retired but doesnt change f10
  • Instruction w begins execution and moves through
    2 stage pipeline
  • Instruction y executed

v ADD.D f10, f2, f4 w MUL.D f10 f10, f6
x ADD.D f12, f10, f8 y ADD.D f4, f4, f6
50
Cycle 4
  • Instruction w finishes execution
  • Instruction y cannot be retired yet

v ADD.D f10, f2, f4 w MUL.D f10 f10, f6
x ADD.D f12, f10, f8 y ADD.D f4, f4, f6
51
Cycle 5
  • Instruction w retired, update f10
  • Instruction y cannot be retired yet
  • Instruction x executed

v ADD.D f10, f2, f4 w MUL.D f10 f10, f6
x ADD.D f12, f10, f8 y ADD.D f4, f4, f6
52
Cycle 6
v ADD.D f10, f2, f4 w MUL.D f10 f10, f6
x ADD.D f12, f10, f8 y ADD.D f4, f4, f6
  • Instructions x, y retired, update f12, f4

53
Example MIPS R10000
  • Can decode 4 instructions per cycle
  • Has 5 execution pipelines
  • Uses dynamic scheduling and out-of-order
    execution
  • Does speculative branching
  • Functional Units
  • Integer ALU1
  • Integer ALU2
  • Load/Store Unit
  • Float Adder
  • Float Multiply

54
7 Pipeline Stages
Stage 1 Fetch
Stage 2 Decode
Stage 3 Issue
Stage 4 Execute
Stage 5 Execute
Stage 6 Execute
Stage 7 Store
Issue
FAdd-1
FAdd-2
FAdd-3
Result
RF
Issue
FMpy-1
FMpy-2
FMpy-3
Result
RF
5 Execution Pipelines
Issue
ALU1
Result
RF
Issue
ALU2
Result
RF
Issue
Add-Calc
Data Cache
Result
RF
Queues
Instructions cache
Decode Branch unit
Branch Address (one branch can be handled every
cycle)
4 instructions Fetch and Decode
Functional Unit (Execute instructions)
55
Pros and Cons
  • Pros
  • The hardware solves everything
  • Hardware detects potential parallelism between
    instructions
  • Hardware tries to issue as many instructions as
    possible in parallel.
  • Hardware solves register renaming.
  • Cons
  • Very complex
  • Much hardware is needed for run-time detection.
    There is a limit in how far we can go with this
    technique.
  • Power consumption can be very large!
  • The window of executions limited Þ this limits
    the capacity to detect potentially parallel
    instructions

56
Multithreading
57
Multithreaded Processors
  • Several register sets
  • Fast Context Switching

Register set 1
Register set 2
Register set 3
Register set 4
Thread 3
Thread 4
Thread 1
Thread 2
58
Execution in Multithreaded Processors
  • cycle-by-cycle Interleaving
  • block interleaving
  • simultaneous multithreading

59
Multithreading Techniques
Multithreading
cycle-by-cycle interleaving
block interleaving
static
dynamic
Switch-on-cache-miss
Switch-on-signal
explicit switch
implicit switch (switch-on-load,
switch-on-store, switch-on-branch, ..)
Switch-on-use
Conditional switch
Source Jurij Silc
60
Multithreading on Scalar
Context switching
Context switching
Context switching
Single threaded
Cycle by cycle interleaving
Block interleaving
61
Single Threaded CPU
  • The different colored boxes in RAM represent
    instructions for four different running programs
  • Only the instructions for the red program are
    actually being executed right now
  • This CPU can issue up to four instructions per
    clock cycle to the execution core, but as you can
    see it never actually reaches this
    four-instruction limit.

62
Single Threaded SMP
The red program and the yellow process both
happen to be executing simultaneously, one on
each processor. Once their respective time slices
are up, their contexts will be saved, their code
and data will be flushed from the CPU, and two
new processes will be prepared for execution.
63
Multithreaded Processors
If the red thread requests data from main memory
and this data isn't present in the cache, then
this thread could stall for many CPU cycles while
waiting for the data to arrive. In the meantime,
however, the processor could execute the yellow
thread while the red one is stalled, thereby
keeping the pipeline full and getting useful work
out of what would otherwise be dead cycles
64
Simultaneous Multithreading (SMT)
SMT is simply Multithreading without the
restriction that all the instructions issued by
the front end on each clock be from the same
thread
65
Contents
  • Four Eras of Computing
  • Parallelism
  • Main Parallel Architecture
  • Interconnection Networks
  • Static IN
  • Dynamic IN
  • Performance Evaluation
  • Parallel Programming

66
  • Four Eras of Computing

67
Four Eras of Computing
Feature Batch Time Sharing Desktop Shared Network
Time Period 1960s 1970s 1980s 1990s/2000s
Location Computer room Terminal room Desktop Mobile
Users Experts Specialists Individuals Groups
Data Alphanumeric Text, numbers Font, Graphs Multimedia
Objective Calculate Access Present Communicate
Interface Punched cards Keyboard, CRT See and point Ask and Tell
Operation Process Edit Layout Orchestrate
Connectivity None Peripheral cable LAN Internet/Wireless
Owners Corporate computer centers Divisional IS shops Departments/individuals Everyone
68
The Computer Evolution
69
Moores Law
  • 1965 prediction by Intel founder Gordon Moore
  • The number of transistors that can be built on
    the same size piece of silicon will double every
    18 months

70
(No Transcript)
71
Processor Evolution
Generation N 1
Generation N
  • Gate delay reduces by 1/ (frequency up by
    )
  • Number of transistors in a constant area goes up
    by 2
  • Additional transistors enable an additional
    increase in performance
  • Result 2x performance at roughly equal cost

72
Can this Growth be sustained forever?
  • Speed of Light Argument
  • (Most people)
  • The Vanishing Electrons Argument
  • (Joel Birnbaum, HP Labs, ACM 97)
  • The FM Radio Analogy
  • (Erik P. DeBenedictis, Sandia National Labs,
    2005)

73
Speed of Light Limit
Light travels 1 cm in
nanosecond
What is the speed if a signal must travel 1 cm
during the execution of an instruction?
74
Joel Birnbaum, HP Labs, 1997
75
FM Radio and End of Moores Law(Erik P.
DeBenedictis, Sandia National Lab, 5/16/05)
Distance
Driving away from FM transmitter?less
signalNoise from electrons ? no change
Shrink
Increasing numbers of gates?less signal
powerNoise from electrons ? no change
76
We need more Computing Power
  • Parallelism is an obvious answer!!

Vision, Human Genome Climate Model, Ocean
Circulation Fluid Turbulence, Viscous
Flow Quantum Chromodynamics Superconductor
Model Vehicle Dynamics Whether Prediction Chemical
Dynamics 3D plasma Oil Reservoir Model
Parallelism Multiple processors cooperate to
jointly execute a single computational task in
order to speed up its execution.
77
Parallelism
  • Multiple processors cooperate to jointly execute
    a single computational task in order to speed up
    its execution.
  • Solve Problems Faster (Speedup)
  • Solve More Problems (Higher Throughput)
  • Solve Larger Problems (Computational Power)
  • Enhance Solutions Quality (Quality Up)

78
Types of Parallelism
Single Data Stream Multiple Data Stream
Single Instruction Stream SISD Uniprocessors SIMD Array Processors Vector
Multiple Instruction Stream MISD MIMD Multiprocessors Multicomputers
Flynns Taxonomy
79
MIMD Categories
Shared Variables Message Passing
Global Memory GMSV Shared Memory Multiprocessors GMMP
Distributed Memory DMSV Distributed Shared Memory DMMP Distributed Memory Multicomputers
Johnsons Expansion
80
  • Main Parallel Architecture

81
SIMD Systems
One control unit Lockstep All Ps do the same or
nothing
von Neumann Computer
Some Interconnection Network
82
MIMD Shared Memory Systems
One global memory Cache Coherence All Ps have
equal access to memory
83
Cache Coherent NUMA
Each P has part of the shared memory Non uniform
memory access
84
MIMD Distributed Memory Systems
No shared memory Message Passing Topology
85
Parallel and Distributed Architecture (Leopold,
2001)
SIMD
SMP
CC-NUMA
DMPC
Cluster
Grid
SIMD
MIMD
Shared Memory
Distributed Memory
tight
Degree of Coupling
loose
Supported Grain Sizes
fine
coarse
Communication Speed
slow
fast
86
Main Components
  • Processors
  • Memory Modules
  • Interconnection Network

87
Interconnection Network Taxonomy
Interconnection Network
Dynamic
Static
Bus-based
Switch-based
1-D
2-D
HC
Crossbar
Single
Multiple
SS
MS
88
Static IN
89
Static Interconnection Networks
  • Static (fixed) interconnection networks are
    characterized by having fixed paths,
    unidirectional or bi-directional, between
    processors.
  • Completely connected networks (CCNs) Number of
    links O(N2), delay complexity O(1).
  • Limited connected network (LCNs)
  • Linear arrays
  • Ring (Loop) networks
  • Two-dimensional arrays
  • Tree networks
  • Cube network

90
Static Network Analysis
  • Graph Representation
  • Parameters
  • Cost
  • Degree
  • Diameter
  • Fault tolerance

91
Graph Review
  • G (V,E) -- V nodes, E edges
  • Directed vs. Undirected
  • Weighted Graphs
  • Path, path length, shortest path
  • Cycles, cyclic vs. acyclic
  • Connectivity connected, weakly connected,
    strongly connected, fully connected

92
Linear Array
N nodes, N-1 edges
Node Degree
Diameter
Cost
Fault Tolerance
93
Ring
N nodes, N edges
Node Degree
Diameter
Cost
Fault Tolerance
94
Chordal Ring
N nodes, N edges
Node Degree
Diameter
Cost
Fault Tolerance
95
Barrel Shifter
  • Number of nodes N 2n
  • Start with a ring
  • Add extra edges from each node to those nodes
    having power of 2 distance
  • i j are connected if j-i 2r, r 0, 1, 2,
    , n-1

96
Mesh and Torus
N nn
Node Degree Internal ? 4 Other ? 3, 2
Diameter 2(n-1)
Node Degree 4
Diameter 2 floor(n/2)
97
Hypercubes
  • N 2d
  • d dimensions (d log N)
  • A cube with d dimensions is made out of 2 cubes
    of dimension d-1
  • Symmetric
  • Degree, Diameter, Cost, Fault tolerance
  • Node labeling number of bits

98
Hypercubes
d 0
d 1
d 2
d 3
99
Hypercubes
100
Hypercube of dimension d
N 2d
d log n
Node degree d
Number of bits to label a node d
Diameter d
Number of edges nd/2
Hamming distance!
Routing
101
Subcubes and Cube Fragmentation
  • What is a subcube?
  • Shared Environment
  • Fragmentation Problem
  • Is it Similar to something you know?

102
Cube Connected Cycles (CCC)
  • k-cube ? 2k nodes
  • k-CCC from k-cube, replace each vertex of the k
    cube with a ring of k nodes
  • K-CCC ? k 2k nodes
  • Degree, diameter ? 3, 2k
  • Try it for 3-cube

103
K-ary n-Cube
  • d cube dimension
  • K nodes along each dimension
  • N kd
  • Wraparound
  • Hupercube ? binary d-cube
  • Tours ? k-ary 2-cube

104
Analysis and performance metricsstatic networks
Network Degree(d) Diameter(D) Cost Symmetry Worst delay
CCNs N-1 1 N(N-1)/2 Yes 1
Linear Array 2 N-1 N-1 No N
Binary Tree 3 2(?log2N? 1) N-1 No log2N
n-cube log2N log2N nN/2 Yes log2N
2D-Mesh 4 2(n-1) 2(N-n) No ?N
K-ary n-cube 2n n?k/2? nN Yes K x log2N
105
Dynamic IN
106
Bus Based IN
Global Memory
Global Memory
107
Dynamic Interconnection Networks
  • Communication patterns are based on program
    demands
  • Connections are established on the fly during
    program execution
  • Multistage Interconnection Network (MIN) and
    Crossbar

108
Switch Modules
  • A x B switch module
  • A inputs and B outputs
  • In practice, A B power of 2
  • Each input is connected to one or more outputs
    (conflicts must be avoided)
  • One-to-one (permutation) and one-to-many are
    allowed

109
Binary Switch
Legitimate States 4
Permutation Connections 2
110
Legitimate Connections
111
Group Work
General Case ??
112
Multistage Interconnection Networks
ISC ? Inter-stage Connection Patterns
ISC1
ISC2
ISCn
switches
switches
switches
113
Perfect-Shuffle Routing Function
  • Given x an, an-1, , a2, a1
  • P(x) an-1, , a2, a1 , an
  • X 110001
  • P(x) 100011

114
Perfect Shuffle Example
  • 000 ? 000
  • 001 ? 010
  • 010 ? 100
  • 011 ? 110
  • 100 ? 001
  • 101 ? 011
  • 110 ? 101
  • 111 ? 111

115
Perfect-Shuffle
000
000
001
001
010
010
011
011
100
100
101
101
110
110
111
111
116
Exchange Routing Function
  • Given x an, an-1, , a2, a1
  • Ei(x) an, an-1, , ai, , a2, a1
  • X 0000000
  • E3(x) 0000100

117
Exchange E1
  • 000 ? 001
  • 001 ? 000
  • 010 ? 011
  • 011 ? 010
  • 100 ? 101
  • 101 ? 100
  • 110 ? 111
  • 111 ? 110

118
Exchange E1
000
000
001
001
010
010
011
011
100
100
101
101
110
110
111
111
119
Butterfly Routing Function
  • Given x an, an-1, , a2, a1
  • B(x) a1 , an-1, , a2, an
  • X 010001
  • P(x) 110000

120
Butterfly Example
  • 000 ? 000
  • 001 ? 100
  • 010 ? 010
  • 011 ? 110
  • 100 ? 001
  • 101 ? 101
  • 110 ? 011
  • 111 ? 111

121
Butterfly
000
000
001
001
010
010
011
011
100
100
101
101
110
110
111
111
122
Multi-stage network
123
MIN (cont.)
An 8X8 Banyan network
124
Min Implementation
Control (X)
Source (S)
Destination (D)
X f(S,D)
125
Example

126
Consider this MIN
stage 2
stage 1
stage 3
D1
S1
D2
S2
S3
D3
D4
S4
D5
S5
D6
S6
D7
S7
S8
D8
127
Example (Cont.)
  • Let control variable be X1, X2, X3
  • Find the values of X1, X2, X3 to connect
  • S1 ? D6
  • S7 ? D5
  • S4 ? D1

128
The 3 connections
stage 2
stage 1
stage 3
D1
S1
D2
S2
S3
D3
D4
S4
D5
S5
D6
S6
D7
S7
S8
D8
129
Boolean Functions
  • X x1, x2, x3
  • S s2, s2, s3
  • D d1, d2, d3
  • Find X f(S,D)

130
Crossbar Switch
131
Analysis and performance metricsdynamic networks
Networks Delay Cost Blocking Degree of FT
Bus O(N) O(1) Yes 0
Multiple-bus O(mN) O(m) Yes (m-1)
MIN O(logN) O(NlogN) Yes 0
Crossbar O(1) O(N2) No 0
About PowerShow.com