Tomorrows Computing Engines February 3, 1998 Symposium on HighPerformance Computer Architecture - PowerPoint PPT Presentation

About This Presentation

Title:

Tomorrows Computing Engines February 3, 1998 Symposium on HighPerformance Computer Architecture

Description:

General's tend to always fight the last war ... RAM. Pin-Bandwidth, 2GB/s. Vector. Reg. File. 104. 32-bit. ALUs. 50GB/s. Switch. 500GB/s ... – PowerPoint PPT presentation

Number of Views:61

Avg rating:3.0/5.0

Slides: 31

Provided by: William964

Learn more at: http://www.ai.mit.edu

Category:

more less

Transcript and Presenter's Notes

Title: Tomorrows Computing Engines February 3, 1998 Symposium on HighPerformance Computer Architecture

1
Tomorrows Computing EnginesFebruary 3,
1998Symposium on High-Performance Computer
Architecture
William J. Dally Computer Systems
Laboratory Stanford University billd_at_csl.stanford.
edu
2
Focus on Tomorrow, not Yesterday
Generals tend to always fight the last
war Computer architects tend to always design
the last computer old programs old technology
assumptions
3
Some Previous Wars (1/3)
Reliable Router 1994
Torus Routing Chip 1985
MARS Router 1984
Network Design Frame 1988
4
Some Previous Wars (2/3)
MDP Chip
J-Machine
Cray T3D
MAP Chip
5
Some Previous Wars (3/3)
6
Tomorrows Computing Engines

Driven by tomorrows applications - media
Constrained by tomorrows technology

7
90 of Desktop Cycles will Be Spent on Media
Applications by 2000

Quote from Scott Kirkpatric of IBM (talk
abstract)
Media applications include
video encode/decode
polygon image-based graphics
audio processing - compression, music, speech -
recognition/synthesis
modulation/demodulation at audio and video rates
These applications involve stream processing
So do
radar processing SAR, STAP, MTI ...

8
Typical Media KernelImage Warp and Composite

Read 10,000 pixels from memory
Perform 100 16-bit integer operations on each
pixel
Test each pixel
Write 3,000 result pixels that pass to memory
Little reuse of data fetched from memory
each pixel used once
Little interaction between pixels
very insensitive to operation latency
Challenge is to maximize bandwidth

9
Telepresence A Driving Application
Acquire 2D Images
Extract Depth (3D Images)
Segmentation Model Extraction
Compression
Channel
Decompression
Rendering
Display 3D Scene
Most kernels Latency insensitive High ratio of
arithmetic to memory references
10
Tomorrows Technology is Wire Limited

Lots of devices
A little faster
Slow wires

11
Technology scaling makes communication the scarce
resource
1997
2007
0.35mm 64Mb DRAM 16 64b FP Proc 400MHz
0.10mm 4Gb DRAM 1K 64b FP Proc 2.5GHz
P
18mm 12,000 tracks 1 clock
32mm 90,000 tracks 20 clocks
12
On-chip wires are getting slower
x2 s x1 0.5x R2 R1/s2 4x C2 C1 1x tw2
R2C2y2 tw1/s2 4x tw2/tg2 tw1/(tg1s3) 8x v
0.5(tgRC)-1/2 (m/s) v2 v1s1/2 0.7x vtg
0.5(tg/RC)1/2 (m/gate) v2tg2 v1tg1s3/2 0.35x
y
y
x1
x2
tw RCy2
RCy2
RCy2
tg
tg
tg
13
Bandwidth and Latency of Modern VLSI
103
1
Bandwidth
100
0.01
Bandwidth
Latency
10
10-4
Latency
1
10-6
10
1
100
103
104
105
Size
Chip Boundary
14
Architecture for LocalityExploit high on-chip
bandwidth
Pin-Bandwidth, 2GB/s
Off-chip RAM
Vector Reg File
104 32-bit ALUs
Switch
50GB/s
500GB/s
15
Tomorrows Computing Engines

Aimed at media processing
stream based
latency tolerant
low-precision
little reuse
lots of conditionals

Use the large number of devices available on
future chips
Make efficient use of scarce communication
resources
bandwidth hierarchy
no centralized resources
Approach the performance of a special-purpose
processor

16
Why do Special-Purpose Processors Perform Well?
Fed by dedicated wires/memories
Lots (100s) of ALUs
17
Care and Feeding of ALUs
Instr. Cache
IP
Instruction Bandwidth
IR
Data Bandwidth
Regs
Feeding Structure Dwarfs ALU
18
Three Key Problems

Instruction bandwidth
Data bandwidth
Conditional execution

19
A Bandwidth Hierarchy
13 ALUs per cluster
SDRAM
ALU Cluster
ALU Cluster
SDRAM
Streaming Memory
Vector Register File
SDRAM
500GB/s
SDRAM
ALU Cluster
1.6GB/s
50GB/s

Solves data bandwidth problem
Matched to bandwidth curve of technology

20
A Streaming Memory System
Reorder Queue
SDRAM Bank
IX
Address Generator
D
Crossbar
Address Generator
Reorder Queue
SDRAM Bank
21
Streaming Memory Performance

Exploit latency insensitivity for improved
bandwidth
1.751 Performance improvement from relatively
short reorder queue

22
Compound Vector Operations1 Instruction does
lots of work
Memory Instructions
Compound Vector Instruction
1 CV Inst (50b)
LD
Vd
Vx
Op
V0
V1
V2
V3
V4
V5
V6
V7
uInst (300b) x 20uInst/Op x 1000el/vec ---------
--------- 6 x 106 b
Control Store
uIP
Mem
AG
VRF
Op
Ra
Rb
Op
Ra
Rb
Op
Ra
Rb
23
Scheduling by Simulated Annealing

List scheduling assumes global communication
does poorly when communication exposed
View scheduling as a CAD problem (place and
route)
generate naïve feasible schedule
iteratively improve schedule by moving
operations.

ALUs
Time
Ready Ops
24
Typical Annealing Schedule
166
Energy function changed
13
25
Conventional Approaches to Data-Dependent
Conditional Execution
A
A
A
x0
y(x0)
Y
N
x0
Y
B
Speculative Loss D x W 1000
if y
Exponentially Decreasing Duty Factor
B
B
J
J
if y
C
C
K
C
if y
Whoops
Data-Dependent Branch
J
K
if y
K
26
Zero-Cost Conditionals

Most Approaches to Conditional Operations are
Costly
Branching control flow - dead issue slots on
mispredicted branches
Predication (SIMD select, masked vectors) - large
fraction of execution opportunities go idle.
Conditional Vectors
append an element to an output stream depending
on a case variable.

Output Stream 0
Result Stream
0
Output Stream 1
1
Case Stream 0,1
27
Application Sketch - Polygon Rendering
V3
Vertex
V1
V2
V3
X
Y
RGB
UV
V2
V1
Y
X1
X2
RGB1
DRGB
UV1
DUV
Y
Span
X1
X2
X
Y
RGB
UV
Pixel
Y
UV
RGB
X
Textured Pixel
X
Y
RGB
28
Status