Scalar Operand Networks: On-Chip Interconnect for ILP in Partitioned Architectures - PowerPoint PPT Presentation

About This Presentation

Title:

Scalar Operand Networks: On-Chip Interconnect for ILP in Partitioned Architectures

Description:

Scalar Operand Networks: On-Chip Interconnect for ILP in ... Decimate bandwidth requirement. Challenge 2: Bandwidth Scalability. A directory scheme for ILP? ... – PowerPoint PPT presentation

Number of Views:30

Avg rating:3.0/5.0

Slides: 60

Provided by: groupsC

Learn more at: http://groups.csail.mit.edu

Category:

more less

Transcript and Presenter's Notes

Title: Scalar Operand Networks: On-Chip Interconnect for ILP in Partitioned Architectures

1
Scalar Operand NetworksOn-Chip Interconnect for
ILP in Partitioned Architectures

Michael Bedford Taylor, Walter Lee,
Saman Amarasinghe, Anant Agarwal

Laboratory for Computer Science Massachusetts
Institute of Technology
2
Motivation
As a thought experiment, lets examine the
Itanium II, published in last years ISSCC
6-way issue Integer Unit lt 2 die area Cache
logic gt 50 die area
INT6
Cache logic
3
Hypothetical Modification
Why not replace a small portion of the cache
with additional issue units? 30-way issue
micro! Integer Units still occupy less than 10
area
INT6
gt 42 cache logic
INT6
INT6
INT6
INT6
4
Can monolithic structures like this be attained
at high frequency?
The 6-way integer unit in Itanium II already
spends 50 of its critical path in bypassing.
ISSCC 2002 25.6 Even
if dynamic logic or logarithmic circuits could be
used to flatten the number of logic levels of
these huge structures
5
...wire delay is inescapable
180 nm
45 nm
1 cycle
Ultimately, wire delay limits the scalability of
un-pipelined, high-frequency, centralized
structures.
6
One solution Chip multiprocessors
e.g., IBMs two-core Power4
Research and Commercial multiprocessors have
been designed to scale to 1000s of ALUs These
multiprocessors scale because they dont have any
centralized resources.
7
Multiprocessors Not Quite Appropriate for ILP

High cost of inter-node operand routing

10s to 100s of cycles to transfer the output
of one instruction to the input of an instruction
on another node
Vast difference between local and
remote communication costs ( 30x )...
.. forces programmers and compilers to use
entirely different algorithms at the two levels
8
An alternative to a CMP a distributed
microprocessor design
Such a microprocessor would distribute resources
to varying degrees Partitioned register
files, Partitioned ALU clusters, Banked
caches, Multiple independent compute
pipelines, ... even multiple program counters
9
Some distributed microprocessordesigns
Conventional Alpha 21264 integer
clusters Radical Proposals UT Austins
Grid, Wisconsins ILDP and Multiscalar MITs
Raw and Scale, Dynamic Dataflow, TTA, Stanfo
rd Smart Memories...
10
Some distributed microprocessorsdesigns
Interesting Secondary Development The
centralized bypass network is being replaced by
a more general, distributed, interconnection
network!
11
Artists View
ld a
ld b
Distributed Resources

gtgt 3

st b
Sophisticated Interconnect
12
How are these networks differentthan existing
networks?
Designed to join operands and operations in space
Route scalar values, not multi-word packets
Ultra-Low latency
Ultra-Low occupancy Unstructured communication
patterns
In this paper, we call these networks scalar
operand networks, whether centralized or
distributed.
13
What can we do to gain insight aboutthe scalar
operand networks?
Looking at a existing systems and proposals,
?Try to figure out whats hard about these
networks
Find a way to classify them
Gain a quantitative understanding
14
5 Challenges for Scalar Operand Networks
Delay Scalability - ability of a design to
maintain high frequencies as that design scales
15
Challenge 1 Delay Scalability
Intra-component ? Structures that grow as the
system scales become bottlenecked by both
interconnect delay and logic
depths Register Files Memories Selection
Logic Wakeup Logic ....
16
Challenge 1 Delay Scalability
Intra-component ? Structures that grow as the
system scales become bottlenecked by both
interconnect delay and logic
depths Register Files Memories Selection
Logic Wakeup Logic ....

Solution Pipeline the structure
Turn propagation delay into pipeline latency
Example Pentium 4 pipelines regfile access

Solution Tile

17
Challenge 1 Delay Scalability
Intra-component
Inter-component ? Problem of wire delay between
components Occurs because it can take many
cycles for remote components to communicate
18
Challenge 1 Delay Scalability
Intra-component
Inter-component ? Problem of wire delay between
components Occurs because it can take many
cycles for remote components to communicate
? Solution Decentralize
Each component must operate with
only partial knowledge. Assign time cost for
transfer of non-local information. Examples
ALU outputs, stall signals, branch
mispredicts, exceptions,
memory dependence info Examples Pentium 4
wires, 21264 int. clusters
19
5 Challenges for Scalar Operand Networks
Delay Scalability - ability of design to scale
while maintaining high frequencies
Bandwidth Scalability - ability of design to
scale without inordinately increasing the
relative percentage of resources dedicated to
interconnect
20
Challenge 2 Bandwidth Scalability
Global broadcasts dont scale Example Snoopy
caches Superscalar Result Buses ?Problem
Each node has to process some sort of incoming
data proportional to the total number of nodes
in the system.
21
Challenge 2 Bandwidth Scalability
Global broadcasts dont scale Example Snoopy
caches Superscalar Result Buses ?Problem
Each node has to process some sort of incoming
data proportional to the total number of nodes
in the system. The delay can be pipelined ala
Alpha 21264, but each node still has to process
too many incoming requests each cycle. Imagine a
30-way issue superscalar where each ALU has its
own register file copy. 30 writes per cycle!
22
Challenge 2 Bandwidth Scalability
Global broadcasts dont scale Example Snoopy
caches Superscalar Result Buses ?Problem
Each node has to process some sort of incoming
data proportional to the total number of nodes
in the system. The delay can be pipelined ala
Alpha 21264, but each node still has to process
too many incoming requests. ?Solution Switch
to a directory scheme Replace bus
with point-to-point network Replace broadcast
with unicast or multicast Decimate bandwidth
requirement
23
Challenge 2 Bandwidth Scalability
A directory scheme for ILP?!!! Isnt that
expensive? Directories store dependence
information, in other words, the locations where
an instruction should send its result Fixed
Assignment Architecture ? Assign each static
instruction to an ALU at compile time ?
Compile dependent ALU locations w/ instrs. ?The
directory is looked up locally when the
instruction is fetched.
24
Challenge 2 Bandwidth Scalability
A directory scheme for ILP?!!! Isnt that
expensive? Directories store dependence
information, in other words, the locations where
an instruction should send its result Fixed
Assignment Architecture ? Assign each static
instruction to an ALU at compile time ?
Compile dependent ALU locations w/ instrs. ?The
directory is looked up locally when the
instruction is fetched. Dynamic Assignment
Architecture ? Harder, somehow we have to
figure out which ALU owns the dynamic
instruction that we are sending to. True
directory lookup may be too .
25
5 Challenges for Scalar Operand Networks
Delay Scalability - ability of design to scale
while maintaining high frequencies
Bandwidth Scalability - ability of design to
scale without inordinately increasing the
relative percentage of resources dedicated to
interconnect
Deadlock and Starvation - distributed systems
need to worry about over-committing internal
buffering example dynamic dataflow machines
throttling
Exceptional Events - Interrupts, branch
mispredictions, exceptions
26
5 Challenges for Scalar Operand Networks
Delay Scalability - ability of design to scale
while maintaining high frequencies
Bandwidth Scalability - ability of design to
scale without inordinately increasing the
relative percentage of resources dedicated to
interconnect
Deadlock and Starvation
Exceptional Events
Efficient Operation-Operand Matching - Gather
operands and operations to meet at some point
in space to perform a dataflow computation
27
Challenge 5 Efficient Operation-Operand
Matching
The rest of this talk!
If operation-operand matching is too
expensive, theres little point to
scaling. Since this is so important, lets try
to come up with a figure of merit for a scalar
operand network -
28
What can we do to gain insight aboutscalar
operand networks?
Looking at a existing systems and proposals,
Try to figure out whats hard about these networks
? Find a way to classify the networks
Gain a quantitative understanding
29
Defining a figure of merit foroperation-operand
matching
5-tuple ltSO, SL, NHL, RL, ROgt
Send Occupancy
Send Latency
Network Hop Latency
Receive Latency
Receive Occupancy
tip Ordering follows timing of message from
sender to receiver
30
The interesting region
conventional lt10, 30,
5,30,40gt distributed multiprocessor
Superscalar lt 0, 0, 0, 0,
0gt (not scalable)
31
Raw Experimental Vehicle
16 instructions per cycle (fp, int, br, ld/st,
alu..) no centralized resources 250 Operand
Routes / cycle Two applicable on-chip networks -
message passing
- dedicated scalar operand network Scalability
story
tiles registered on input, just add more tiles
Simulations are for 64 tiles, prototype has 16
32
The interesting region
conventional lt10, 30,
5,30,40gt distributed multiprocessor
Superscalar lt 0, 0, 0, 0,
0gt (not scalable)
33
Two points in the interesting region
conventional lt10, 30,
5,30,40gt distributed multiprocessor Raw /
msg passing lt 3, 2, 1, 1, 7gt Raw / scalar lt
0, 1, 1, 1, 0gt Superscalar lt 0, 0, 0,
0, 0gt (not scalable)
34
Message Passing 5-tuple lt3,
(Using Raws on-chip message passing network)
Three wasted cycles per send ?Sender Occupancy
3
compute value
send header
send message
send sequence
send value
use the value
35
Message Passing 5-tuple lt3,2,
Two cycles for message to exit proc ? Sender
Latency 2 (Assumes early commit point)
compute value
send header
send sequence
send value
use the value
36
Message Passing 5-tuple lt3,2,1,
Messages take one cycle per hop ? Per-hop
latency 1
compute value
send header
send sequence
send value
use the value
37
Message Passing 5-tuple lt3,2,1,1,
One cycle for message to enter proc ? Receive
Latency 1
compute value
send header
send sequence
send value
use the value
38
Message Passing 5-tuple lt3,2,1,1,7gt
Seven wasted cycles for receive ? Receive
Occupancy 7 (minimum)
load tag
branch if set
demultiplex message
compute value
get sequence
send header
compare
send sequence
branch if not eq
send value
use the value
39
Raws 5-tuple lt0,
Zero wasted cycles per send ?Sender Occupancy 0
compute, send value
use the value
40
Raws 5-tuple lt0,1,
One cycles for message to exit proc ? Sender
Latency 1
compute value
compute, send value
use the value
41
Raws 5-tuple lt0,1,1,
Messages take one cycle per hop ? Per-hop
latency 1
compute, send value
use the value
42
Raws 5-tuple lt0,1,1,1,
One cycle for message to enter proc ? Receive
Latency 1
compute, send value
use the value
43
Raws 5-tuple lt0,1,1,1,0gt
No wasted cycles for receive ? Receive Occupancy
0
compute, send value
use the value
44
Superscalar 5-tuple lt0,
Zero wasted cycles for send ? Send Occupancy 0
compute, send value
use the value
45
Superscalar 5-tuple lt0,0,0,0,
Zero cycles for all latencies ? Send, Hop,
Receive Latencies 0
compute, send value
use the value
46
Superscalar 5-tuple lt0,0,0,0,0gt
No wasted cycles for receive ? Receive Occupancy
0
compute, send value
use the value
47
Superscalar 5-tuple, late wakeup
Wakeup signal will usually have to be sent ahead
of time. If its not, then the 5-tuple could be
lt0,0,0,1,0gt.
use the value
compute, send value
Wakeup, select
48
5-tuples of several architectures
Superscalar lt0, 0,0, 0,0gtMessage
Passing lt3, 2c,1, 1,7gt lt3, 3c,1,
1,12gt Distributed Shared Memory (F/E bits)
lt1,14c,2,14,1gt Raw lt0, 1,1,
1,0gt ILDP lt0, n,0, 1,0gt (n 0,
2) Grid lt0, 0,n/8, 0,0gt (n 0..8)
49
What can we do to gain insight aboutscalar
operand networks?
Looking at a existing systems and proposals,
Try to figure out whats hard about these networks
Find a way to classify the systems
? Gain a quantitative understanding
50
5-tuple Simulation Experiments
Raws actual scalar operand network Raw
Magic parameterized scalar operand network
- Each tile has FIFOs connected to every other
tile.
- Allows us to vary latencies and measure
contention
lt0,1,1,1,0gt Raw
lt0,1,1,1,0gt Magic Network
..Vary all 5 parameters..
lt1,14,2,14,0gt Magic Network, Shared Memory Costs
lt3,3,1,1,12gt Magic Network, Message Passing Costs
..and others
51
Raws Scalar Operand Network i.e., lt0,1,1,1,0gt
2 4 8 16 32 64
cholesky 1.62 3.23 6.00 9.19 11.90 12.93
vpenta 1.71 3.11 6.09 12.13 24.17 44.87
mxm 1.93 3.73 6.21 8.90 14.84 20.47
fppp-kernel 1.51 3.34 5.72 6.14 5.99 6.54
sha 1.12 1.96 1.98 2.32 2.54 2.52
swim 1.60 2.62 4.69 8.30 17.09 28.89
jacobi 1.43 2.76 4.95 9.30 15.88 22.76
life 1.81 3.37 6.44 12.05 21.08 36.10
Speedup versus 1 Tile
52
Impact of Receive Occupancy, 64 tiles,i.e.,
lt0,1,1,1,ngt
53
Impact of Receive Latency, 64 tiles,i.e.,
lt0,0,0,n,0gt magic
54
Impact of Contentioni.e. Magic lt0,1,1,1,0gt /
Raws lt0,1,1,1,0gt
55
Raw
ILDP
Grid
Superscalar
lt0,0,0,0,0gt
lt0,1,1,1,0gt
lt0,1,N/8,1,0gt
lt0,N,0,1,0gt
5-Tuple
Operand Transport
Point to Point
Point to Point
Broadcast
Broadcast
Message Demultiplex
Associative Instr Window
Compile Time Scheduling
Distributed Associative Instr Window
F/E bits on distributed register files
Compile Time Ordering
Runtime Ordering
Compile Time Ordering
Intranode instr. order
Runtime ordering
Free intranode bypassing
yes
yes
no
yes
Dynamic Assignment
Compiler Assignment
Compiler Assignment
Dynamic Assignment
Instr distribution
56
Open Scalar Operand Network ?s