Title: Tomorrows Computing Engines February 3, 1998 Symposium on HighPerformance Computer Architecture
1Tomorrows Computing EnginesFebruary 3,
1998Symposium on High-Performance Computer
Architecture
William J. Dally Computer Systems
Laboratory Stanford University billd_at_csl.stanford.
edu
2Focus on Tomorrow, not Yesterday
Generals tend to always fight the last
war Computer architects tend to always design
the last computer old programs old technology
assumptions
3Some Previous Wars (1/3)
Reliable Router 1994
Torus Routing Chip 1985
MARS Router 1984
Network Design Frame 1988
4Some Previous Wars (2/3)
MDP Chip
J-Machine
Cray T3D
MAP Chip
5Some Previous Wars (3/3)
6Tomorrows Computing Engines
- Driven by tomorrows applications - media
- Constrained by tomorrows technology
790 of Desktop Cycles will Be Spent on Media
Applications by 2000
- Quote from Scott Kirkpatric of IBM (talk
abstract) - Media applications include
- video encode/decode
- polygon image-based graphics
- audio processing - compression, music, speech -
recognition/synthesis - modulation/demodulation at audio and video rates
- These applications involve stream processing
- So do
- radar processing SAR, STAP, MTI ...
8Typical Media KernelImage Warp and Composite
- Read 10,000 pixels from memory
- Perform 100 16-bit integer operations on each
pixel - Test each pixel
- Write 3,000 result pixels that pass to memory
- Little reuse of data fetched from memory
- each pixel used once
- Little interaction between pixels
- very insensitive to operation latency
- Challenge is to maximize bandwidth
9Telepresence A Driving Application
Acquire 2D Images
Extract Depth (3D Images)
Segmentation Model Extraction
Compression
Channel
Decompression
Rendering
Display 3D Scene
Most kernels Latency insensitive High ratio of
arithmetic to memory references
10Tomorrows Technology is Wire Limited
- Lots of devices
- A little faster
- Slow wires
11Technology scaling makes communication the scarce
resource
1997
2007
0.35mm 64Mb DRAM 16 64b FP Proc 400MHz
0.10mm 4Gb DRAM 1K 64b FP Proc 2.5GHz
P
18mm 12,000 tracks 1 clock
32mm 90,000 tracks 20 clocks
12On-chip wires are getting slower
x2 s x1 0.5x R2 R1/s2 4x C2 C1 1x tw2
R2C2y2 tw1/s2 4x tw2/tg2 tw1/(tg1s3) 8x v
0.5(tgRC)-1/2 (m/s) v2 v1s1/2 0.7x vtg
0.5(tg/RC)1/2 (m/gate) v2tg2 v1tg1s3/2 0.35x
y
y
x1
x2
tw RCy2
RCy2
RCy2
tg
tg
tg
13Bandwidth and Latency of Modern VLSI
103
1
Bandwidth
100
0.01
Bandwidth
Latency
10
10-4
Latency
1
10-6
10
1
100
103
104
105
Size
Chip Boundary
14Architecture for LocalityExploit high on-chip
bandwidth
Pin-Bandwidth, 2GB/s
Off-chip RAM
Vector Reg File
104 32-bit ALUs
Switch
50GB/s
500GB/s
15Tomorrows Computing Engines
- Aimed at media processing
- stream based
- latency tolerant
- low-precision
- little reuse
- lots of conditionals
- Use the large number of devices available on
future chips - Make efficient use of scarce communication
resources - bandwidth hierarchy
- no centralized resources
- Approach the performance of a special-purpose
processor
16Why do Special-Purpose Processors Perform Well?
Fed by dedicated wires/memories
Lots (100s) of ALUs
17Care and Feeding of ALUs
Instr. Cache
IP
Instruction Bandwidth
IR
Data Bandwidth
Regs
Feeding Structure Dwarfs ALU
18Three Key Problems
- Instruction bandwidth
- Data bandwidth
- Conditional execution
19A Bandwidth Hierarchy
13 ALUs per cluster
SDRAM
ALU Cluster
ALU Cluster
SDRAM
Streaming Memory
Vector Register File
SDRAM
500GB/s
SDRAM
ALU Cluster
1.6GB/s
50GB/s
- Solves data bandwidth problem
- Matched to bandwidth curve of technology
20A Streaming Memory System
Reorder Queue
SDRAM Bank
IX
Address Generator
D
Crossbar
Address Generator
Reorder Queue
SDRAM Bank
21Streaming Memory Performance
- Exploit latency insensitivity for improved
bandwidth - 1.751 Performance improvement from relatively
short reorder queue
22Compound Vector Operations1 Instruction does
lots of work
Memory Instructions
Compound Vector Instruction
1 CV Inst (50b)
LD
Vd
Vx
Op
V0
V1
V2
V3
V4
V5
V6
V7
uInst (300b) x 20uInst/Op x 1000el/vec ---------
--------- 6 x 106 b
Control Store
uIP
Mem
AG
VRF
Op
Ra
Rb
Op
Ra
Rb
Op
Ra
Rb
23Scheduling by Simulated Annealing
- List scheduling assumes global communication
- does poorly when communication exposed
- View scheduling as a CAD problem (place and
route) - generate naïve feasible schedule
- iteratively improve schedule by moving
operations.
ALUs
Time
Ready Ops
24Typical Annealing Schedule
166
Energy function changed
13
25Conventional Approaches to Data-Dependent
Conditional Execution
A
A
A
x0
y(x0)
Y
N
x0
Y
B
Speculative Loss D x W 1000
if y
Exponentially Decreasing Duty Factor
B
B
J
J
if y
C
C
K
C
if y
Whoops
Data-Dependent Branch
J
K
if y
K
26Zero-Cost Conditionals
- Most Approaches to Conditional Operations are
Costly - Branching control flow - dead issue slots on
mispredicted branches - Predication (SIMD select, masked vectors) - large
fraction of execution opportunities go idle. - Conditional Vectors
- append an element to an output stream depending
on a case variable.
Output Stream 0
Result Stream
0
Output Stream 1
1
Case Stream 0,1
27Application Sketch - Polygon Rendering
V3
Vertex
V1
V2
V3
X
Y
RGB
UV
V2
V1
Y
X1
X2
RGB1
DRGB
UV1
DUV
Y
Span
X1
X2
X
Y
RGB
UV
Pixel
Y
UV
RGB
X
Textured Pixel
X
Y
RGB
28Status
- Working simulator of Imagine
- Simple kernels running on simulator
- FFT
- Applications being developed
- Depth extraction, video compression, polygon
rendering, image-based graphics - Circuit/Layout studies underway
29Acknowledgements
- Students/Staff
- Don Alpert (Intel)
- Chris Buehler (MIT)
- J.P Grossman (MIT)
- Brad Johanson
- Ujval Kapasi
- Brucek Khailany
- Abelardo Lopez-Lagunas
- Peter Mattson
- John Owens
- Scott Rixner
- Helpful Suggestions
- Henry Fuchs (UNC)
- Pat Hanrahan
- Tom Knight (MIT)
- Marc Levoy
- Leonard McMillan (MIT)
- John Poulton (UNC)
30Conclusion
- Work toward tomorrows computing engines
- Targeted toward media processing
- streams of low-precision samples
- little reuse
- latency tolerant
- Matched to the capabilities of communication-limit
ed technology - explicit bandwidth hierarchy
- explicit communication between units
- communication exposed
- Insight not numbers