Tomorrows Computing Engines February 3, 1998 Symposium on HighPerformance Computer Architecture - PowerPoint PPT Presentation

About This Presentation
Title:

Tomorrows Computing Engines February 3, 1998 Symposium on HighPerformance Computer Architecture

Description:

General's tend to always fight the last war ... RAM. Pin-Bandwidth, 2GB/s. Vector. Reg. File. 104. 32-bit. ALUs. 50GB/s. Switch. 500GB/s ... – PowerPoint PPT presentation

Number of Views:61
Avg rating:3.0/5.0
Slides: 31
Provided by: William964
Learn more at: http://www.ai.mit.edu
Category:

less

Transcript and Presenter's Notes

Title: Tomorrows Computing Engines February 3, 1998 Symposium on HighPerformance Computer Architecture


1
Tomorrows Computing EnginesFebruary 3,
1998Symposium on High-Performance Computer
Architecture
William J. Dally Computer Systems
Laboratory Stanford University billd_at_csl.stanford.
edu
2
Focus on Tomorrow, not Yesterday
Generals tend to always fight the last
war Computer architects tend to always design
the last computer old programs old technology
assumptions
3
Some Previous Wars (1/3)
Reliable Router 1994
Torus Routing Chip 1985
MARS Router 1984
Network Design Frame 1988
4
Some Previous Wars (2/3)
MDP Chip
J-Machine
Cray T3D
MAP Chip
5
Some Previous Wars (3/3)
6
Tomorrows Computing Engines
  • Driven by tomorrows applications - media
  • Constrained by tomorrows technology

7
90 of Desktop Cycles will Be Spent on Media
Applications by 2000
  • Quote from Scott Kirkpatric of IBM (talk
    abstract)
  • Media applications include
  • video encode/decode
  • polygon image-based graphics
  • audio processing - compression, music, speech -
    recognition/synthesis
  • modulation/demodulation at audio and video rates
  • These applications involve stream processing
  • So do
  • radar processing SAR, STAP, MTI ...

8
Typical Media KernelImage Warp and Composite
  • Read 10,000 pixels from memory
  • Perform 100 16-bit integer operations on each
    pixel
  • Test each pixel
  • Write 3,000 result pixels that pass to memory
  • Little reuse of data fetched from memory
  • each pixel used once
  • Little interaction between pixels
  • very insensitive to operation latency
  • Challenge is to maximize bandwidth

9
Telepresence A Driving Application
Acquire 2D Images
Extract Depth (3D Images)
Segmentation Model Extraction
Compression
Channel
Decompression
Rendering
Display 3D Scene
Most kernels Latency insensitive High ratio of
arithmetic to memory references
10
Tomorrows Technology is Wire Limited
  • Lots of devices
  • A little faster
  • Slow wires

11
Technology scaling makes communication the scarce
resource
1997
2007
0.35mm 64Mb DRAM 16 64b FP Proc 400MHz
0.10mm 4Gb DRAM 1K 64b FP Proc 2.5GHz
P
18mm 12,000 tracks 1 clock
32mm 90,000 tracks 20 clocks
12
On-chip wires are getting slower
x2 s x1 0.5x R2 R1/s2 4x C2 C1 1x tw2
R2C2y2 tw1/s2 4x tw2/tg2 tw1/(tg1s3) 8x v
0.5(tgRC)-1/2 (m/s) v2 v1s1/2 0.7x vtg
0.5(tg/RC)1/2 (m/gate) v2tg2 v1tg1s3/2 0.35x
y
y
x1
x2
tw RCy2
RCy2
RCy2
tg
tg
tg
13
Bandwidth and Latency of Modern VLSI
103
1
Bandwidth
100
0.01
Bandwidth
Latency
10
10-4
Latency
1
10-6
10
1
100
103
104
105
Size
Chip Boundary
14
Architecture for LocalityExploit high on-chip
bandwidth
Pin-Bandwidth, 2GB/s
Off-chip RAM
Vector Reg File
104 32-bit ALUs
Switch
50GB/s
500GB/s
15
Tomorrows Computing Engines
  • Aimed at media processing
  • stream based
  • latency tolerant
  • low-precision
  • little reuse
  • lots of conditionals
  • Use the large number of devices available on
    future chips
  • Make efficient use of scarce communication
    resources
  • bandwidth hierarchy
  • no centralized resources
  • Approach the performance of a special-purpose
    processor

16
Why do Special-Purpose Processors Perform Well?
Fed by dedicated wires/memories
Lots (100s) of ALUs
17
Care and Feeding of ALUs
Instr. Cache
IP
Instruction Bandwidth
IR
Data Bandwidth
Regs
Feeding Structure Dwarfs ALU
18
Three Key Problems
  • Instruction bandwidth
  • Data bandwidth
  • Conditional execution

19
A Bandwidth Hierarchy
13 ALUs per cluster
SDRAM
ALU Cluster
ALU Cluster
SDRAM
Streaming Memory
Vector Register File
SDRAM
500GB/s
SDRAM
ALU Cluster
1.6GB/s
50GB/s
  • Solves data bandwidth problem
  • Matched to bandwidth curve of technology

20
A Streaming Memory System
Reorder Queue
SDRAM Bank
IX
Address Generator
D
Crossbar
Address Generator
Reorder Queue
SDRAM Bank
21
Streaming Memory Performance
  • Exploit latency insensitivity for improved
    bandwidth
  • 1.751 Performance improvement from relatively
    short reorder queue

22
Compound Vector Operations1 Instruction does
lots of work
Memory Instructions
Compound Vector Instruction
1 CV Inst (50b)
LD
Vd
Vx
Op
V0
V1
V2
V3
V4
V5
V6
V7
uInst (300b) x 20uInst/Op x 1000el/vec ---------
--------- 6 x 106 b
Control Store
uIP
Mem
AG
VRF
Op
Ra
Rb
Op
Ra
Rb
Op
Ra
Rb
23
Scheduling by Simulated Annealing
  • List scheduling assumes global communication
  • does poorly when communication exposed
  • View scheduling as a CAD problem (place and
    route)
  • generate naïve feasible schedule
  • iteratively improve schedule by moving
    operations.

ALUs
Time
Ready Ops
24
Typical Annealing Schedule
166
Energy function changed
13
25
Conventional Approaches to Data-Dependent
Conditional Execution
A
A
A
x0
y(x0)
Y
N
x0
Y
B
Speculative Loss D x W 1000
if y
Exponentially Decreasing Duty Factor
B
B
J
J
if y
C
C
K
C
if y
Whoops
Data-Dependent Branch
J
K
if y
K
26
Zero-Cost Conditionals
  • Most Approaches to Conditional Operations are
    Costly
  • Branching control flow - dead issue slots on
    mispredicted branches
  • Predication (SIMD select, masked vectors) - large
    fraction of execution opportunities go idle.
  • Conditional Vectors
  • append an element to an output stream depending
    on a case variable.

Output Stream 0
Result Stream
0
Output Stream 1
1
Case Stream 0,1
27
Application Sketch - Polygon Rendering
V3
Vertex
V1
V2
V3
X
Y
RGB
UV
V2
V1
Y
X1
X2
RGB1
DRGB
UV1
DUV
Y
Span
X1
X2
X
Y
RGB
UV
Pixel
Y
UV
RGB
X
Textured Pixel
X
Y
RGB
28
Status
  • Working simulator of Imagine
  • Simple kernels running on simulator
  • FFT
  • Applications being developed
  • Depth extraction, video compression, polygon
    rendering, image-based graphics
  • Circuit/Layout studies underway

29
Acknowledgements
  • Students/Staff
  • Don Alpert (Intel)
  • Chris Buehler (MIT)
  • J.P Grossman (MIT)
  • Brad Johanson
  • Ujval Kapasi
  • Brucek Khailany
  • Abelardo Lopez-Lagunas
  • Peter Mattson
  • John Owens
  • Scott Rixner
  • Helpful Suggestions
  • Henry Fuchs (UNC)
  • Pat Hanrahan
  • Tom Knight (MIT)
  • Marc Levoy
  • Leonard McMillan (MIT)
  • John Poulton (UNC)

30
Conclusion
  • Work toward tomorrows computing engines
  • Targeted toward media processing
  • streams of low-precision samples
  • little reuse
  • latency tolerant
  • Matched to the capabilities of communication-limit
    ed technology
  • explicit bandwidth hierarchy
  • explicit communication between units
  • communication exposed
  • Insight not numbers
Write a Comment
User Comments (0)
About PowerShow.com