Title: Stream Programming: Luring Programmers into the Multicore Era Bill Thies
1Stream Programming LuringProgrammers into the
Multicore EraBill Thies
- Computer Science and Artificial Intelligence
Laboratory - Massachusetts Institute of Technology
- Spring 2008
2Multicores are Here
512
Picochip PC102
Ambric AM2045
256
Cisco CSR-1
128
Intel Tflops
Tilera
64
32
of cores
Raza XLR
Cavium Octeon
Raw
16
Cell
8
Niagara
Opteron 4P
Broadcom 1480
4
Xeon MP
Xbox360
PA-8800
Opteron
Tanglewood
2
Power4
PExtreme
Power6
Yonah
4004
8086
8080
286
386
486
Pentium
P2
P3
Itanium
1
P4
8008
Itanium 2
Athlon
1985
1990
1980
1970
1975
1995
2000
2005
20??
3Multicores are Here
512
Picochip PC102
Ambric AM2045
256
Cisco CSR-1
128
Intel Tflops
Tilera
64
32
of cores
Hardware wasresponsible for improving performance
Raza XLR
Cavium Octeon
Raw
16
Cell
8
Niagara
Opteron 4P
Broadcom 1480
4
Xeon MP
Xbox360
PA-8800
Opteron
Tanglewood
2
Power4
PExtreme
Power6
Yonah
4004
8086
8080
286
386
486
Pentium
P2
P3
Itanium
1
P4
8008
Itanium 2
Athlon
1985
1990
1980
1970
1975
1995
2000
2005
20??
4Multicores are Here
512
Picochip PC102
Ambric AM2045
256
Cisco CSR-1
128
Intel Tflops
Tilera
64
32
of cores
Now, performanceburden falls on programmers
Raza XLR
Cavium Octeon
Raw
16
Cell
8
Niagara
Opteron 4P
Broadcom 1480
4
Xeon MP
Xbox360
PA-8800
Opteron
Tanglewood
2
Power4
PExtreme
Power6
Yonah
4004
8086
8080
286
386
486
Pentium
P2
P3
Itanium
1
P4
8008
Itanium 2
Athlon
1985
1990
1980
1970
1975
1995
2000
2005
20??
5Is Parallel Programming a New Problem?
- No! Decades of research targeting
multiprocessors - Languages, compilers, architectures, tools
- What is different today?
- 1. Multicores vs. multiprocessors. Multicores
have - - New interconnects with non-uniform
communication costs - - Faster on-chip communication than off-chip
I/O, memory ops - - Limited per-core memory availability
- 2. Non-expert programmers
- - Supercomputers with gt2048 processors today
100 top500.org - - Machines with gt2048 cores in 2020 gt100
million ITU, Moore - 3. Application trends
- - Embedded 2.7 billion cell phones vs 850
million PCs ITU 2006 - - Data-centric YouTube streams 200 TB of video
daily
6Streaming Application Domain
AtoD
- For programs based on streams of data
- Audio, video, DSP, networking, and cryptographic
processing kernels - Examples HDTV editing, radar tracking,
microphone arrays, cell phone base stations,
graphics
FMDemod
Duplicate
LPF1
LPF2
LPF3
HPF1
HPF2
HPF3
RoundRobin
Adder
Speaker
7Streaming Application Domain
AtoD
- For programs based on streams of data
- Audio, video, DSP, networking, and cryptographic
processing kernels - Examples HDTV editing, radar tracking,
microphone arrays, cell phone base stations,
graphics - Properties of stream programs
- Regular and repeating computation
- Independent filters with explicit communication
- Data items have short lifetimes
FMDemod
Duplicate
LPF1
LPF2
LPF3
HPF1
HPF2
HPF3
RoundRobin
Adder
Speaker
8Brief History of Streaming
1960
1970
1980
1990
2000
Models of Computation
Kahn Proc. Networks
Synchronous Dataflow
Petri Nets
Comp. Graphs
Communicating Sequential Processes
Modeling Environments
Ptolemy
Matlab/Simulink
etc.
Gabriel
Grape-II
Languages / Compilers
Sisal
Esterel
Erlang
Lucid
Id
pH
Occam
VAL
lazy
LUSTRE
C
9Brief History of Streaming
1960
1970
1980
1990
2000
Models of Computation
Kahn Proc. Networks
Synchronous Dataflow
Petri Nets
Comp. Graphs
Communicating Sequential Processes
Modeling Environments
Ptolemy
Matlab/Simulink
etc.
Gabriel
Grape-II
Languages / Compilers
Sisal
Esterel
Erlang
Lucid
Id
pH
Occam
VAL
lazy
LUSTRE
C
- Weaknesses
- Unsuitable for static analysis
- Cannot leverage deep results from DSP /
modeling community
- Strengths
- Elegance
- Generality
10Brief History of Streaming
1960
1970
1980
1990
2000
Models of Computation
Kahn Proc. Networks
Synchronous Dataflow
Petri Nets
Comp. Graphs
Communicating Sequential Processes
Modeling Environments
Ptolemy
Matlab/Simulink
etc.
Gabriel
Grape-II
Languages / Compilers
Sisal
Esterel
Erlang
StreamIt
Brook
Lucid
Id
StreamC
pH
Cg
Occam
VAL
lazy
LUSTRE
C
- Weaknesses
- Unsuitable for static analysis
- Cannot leverage deep results from DSP /
modeling community
- Strengths
- Elegance
- Generality
Stream Programming
11StreamIt A Language and Compilerfor Stream
Programs
- Key idea design language that enables static
analysis - Goals
- Expose and exploit the parallelism in stream
programs - Improve programmer productivity in the streaming
domain - Project contributions
- Language design for streaming CC'02, CAN'02,
PPoPP'05, IJPP'05 - Automatic parallelization ASPLOS'02,
G.Hardware'05, ASPLOS'06 - Domain-specific optimizations PLDI'03, CASES'05,
TechRep'07 - Cache-aware scheduling LCTES'03, LCTES'05
- Extracting streams from legacy code MICRO'07
- User application studies PLDI'05, P-PHEC'05,
IPDPS'06
7 years, 25 people, 300 KLOC 700 external
downloads, 5 external publications
12StreamIt A Language and Compilerfor Stream
Programs
- Key idea design language that enables static
analysis - Goals
- Expose and exploit the parallelism in stream
programs - Improve programmer productivity in the streaming
domain - I contributed to
- Language design for streaming CC'02, CAN'02,
PPoPP'05, IJPP'05 - Automatic parallelization ASPLOS'02,
G.Hardware'05, ASPLOS'06 - Domain-specific optimizations PLDI'03, CASES'05,
TechRep'07 - Cache-aware scheduling LCTES'03, LCTES'05
- Extracting streams from legacy code MICRO'07
- User application studies PLDI'05, P-PHEC'05,
IPDPS'06
7 years, 25 people, 300 KLOC 700 external
downloads, 5 external publications
13StreamIt A Language and Compilerfor Stream
Programs
- Key idea design language that enables static
analysis - Goals
- Expose and exploit the parallelism in stream
programs - Improve programmer productivity in the streaming
domain - This talk
- Language design for streaming CC'02, CAN'02,
PPoPP'05, IJPP'05 - Automatic parallelization ASPLOS'02,
G.Hardware'05, ASPLOS'06 - Domain-specific optimizations PLDI'03, CASES'05,
TechRep'07 - Cache-aware scheduling LCTES'03, LCTES'05
- Extracting streams from legacy code MICRO'07
- User application studies PLDI'05, P-PHEC'05,
IPDPS'06
7 years, 25 people, 300 KLOC 700 external
downloads, 5 external publications
14Part 1 Language DesignJoint work with Michael
Gordon
- William Thies, Michal Karczmarek, Saman
Amarasinghe (CC02) - William Thies, Michal Karczmarek, Janis
Sermulins, Rodric Rabbah,Saman Amarasinghe
(PPoPP05)
15StreamIt Language Basics
- High-level, architecture-independent language
- Backend support for uniprocessors, multicores
(Raw, SMP), cluster of workstations - Model of computation synchronous dataflow
- Program is a graph of independent filters
- Filters have an atomic execution stepwith known
input / output rates - Compiler is responsible for scheduling and
buffer management - Extensions to synchronous dataflow
- Dynamic I/O rates
- Support for sliding window operations
- Teleport messaging PPoPP05
Lee Messerschmidt, 1987
Input
x 10
1
10
Decimate
x 1
1
1
Output
x 1
16Representing Streams
- Conventional wisdom stream programs are graphs
- Graphs have no simple textual representation
- Graphs are difficult to analyze and optimize
- Insight stream programs have structure
unstructured
17Structured Streams
filter
- Each structure is single-input, single-output
- Hierarchical and composable
pipeline
may be any StreamIt language construct
splitjoin
joiner
splitter
feedback loop
splitter
joiner
18Radar-Array Front End
19Filterbank
20FFT
21Block Matrix Multiply
22MP3 Decoder
23Bitonic Sort
24FM Radio with Equalizer
25Ground Moving Target Indicator (GMTI)
- 99 filters
- 3566 filter instances
26Example Syntax FMRadio
- void-gtvoid pipeline FMRadio(int N, float lo,
float hi) - add AtoD()
-
- add FMDemod()
- add splitjoin
- split duplicate
- for (int i0 iltN i)
- add pipeline
- add LowPassFilter(lo i(hi - lo)/N)
- add HighPassFilter(lo i(hi - lo)/N)
-
-
- join roundrobin()
AtoD
FMDemod
Duplicate
LPF1
LPF2
LPF3
HPF1
HPF2
HPF3
RoundRobin
Adder
Speaker
27StreamIt Application Suite
- Software radio
- Frequency hopping radio
- Acoustic beam former
- Vocoder
- FFTs and DCTs
- JPEG Encoder/Decoder
- MPEG-2 Encoder/Decoder
- MPEG-4 (fragments)
- Sorting algorithms
- GMTI (Ground Moving Target Indicator)
- DES and Serpent crypto algorithms
- SSCA3 (HPCS scalable benchmark for synthetic
aperture radar) - Mosaic imaging using RANSAC algorithm
Total size 60,000 lines of code
28Control Messages
AtoD
- Occasionally, low-bandwidth control messages are
sent between actors - Often demands precise timing
- Communications adjust protocol,amplification,
compression - Network router cancel invalid packet
- Adaptive beamformer track a target
- Respond to user input, runtime errors
- Frequency hopping radio
- Traditional techniques
- Direct method call (no timing guarantees)
- Embed message in stream (opaque, slow)
Decode
duplicate
LPF2
LPF1
LPF3
HPF2
HPF1
HPF3
roundrobin
Encode
Transmit
29Idea 2 Teleport Messaging
- Looks like method call, but timed relative to
data in the stream - Exposes dependences to compiler
- Simple and precise for user
- - Adjustable latency
- - Can send upstream or downstream
TargetFilter x if newProtocol(p)
x.setProtocol(p) _at_ 2
void setProtocol(int p) reconfig(p)
30Part 2 Automatic ParallelizationJoint work
with Michael Gordon
- Michael I. Gordon, William Thies, Saman
Amarasinghe (ASPLOS06) - Michael I. Gordon, William Thies, Michal
Karczmarek, Jasper Lin, Ali S. Meli, Andrew A.
Lamb, Chris Leger, Jeremy Wong, Henry Hoffmann,
David Maze, Saman Amarasinghe (ASPLOS02)
31Streaming is an Implicitly Parallel Model
- Programmer thinks about functionality, not
parallelism - More explicit models may
- Require knowledge of target MPI cG
- Require parallelism annotations OpenMP HPF
Cilk Intel TBB - Novelty over other implicit models?Erlang
MapReduce Sequoia pH Occam Sisal Id
VAL LUSTREHAL THAL SALSA Rosette
ABCL APL ZPL NESL
? Exploiting streaming structure for robust
performance
32Parallelism in Stream Programs
- Task parallelism
- Analogous to thread (fork/join) parallelism
- Data Parallelism
- Peel iterations of filter, place within
scatter/gather pair (fission) - parallelize filters with state
- Pipeline Parallelism
- Between producers and consumers
- Stateful filters can be parallelized
33Parallelism in Stream Programs
- Task parallelism
- Analogous to thread (fork/join) parallelism
- Data parallelism
- Analogous to DOALL loops
- Pipeline parallelism
- Analogous to ILP that is exploited in hardware
Splitter
Stateless
Joiner
Splitter
Pipeline
Joiner
Data
Task
34Baseline Fine-Grained Data Parallelism
Splitter
Splitter
Splitter
BandPass
BandPass
BandPass
BandPass
BandPass
BandPass
BandPass
BandPass
Joiner
Joiner
Splitter
Splitter
Compress
Compress
Compress
Compress
Compress
Compress
Compress
Compress
Joiner
Joiner
Splitter
Splitter
Process
Process
Process
Process
Process
Process
Process
Process
Joiner
Joiner
Splitter
Splitter
Expand
Expand
Expand
Expand
Expand
Expand
Expand
Expand
Joiner
Joiner
BandStop
Splitter
BandStop
Splitter
BandStop
BandStop
BandStop
BandStop
BandStop
BandStop
Joiner
Joiner
Joiner
BandStop
Splitter
BandStop
BandStop
Adder
Adder
Joiner
35EvaluationFine-Grained Data Parallelism
36EvaluationFine-Grained Data Parallelism
Good Parallelism! Too Much Synchronization!
37Coarsening the Granularity
Splitter
BandPass
BandPass
Compress
Compress
Process
Process
Expand
Expand
BandStop
BandStop
Joiner
Adder
38Coarsening the Granularity
39Coarsening the Granularity
Splitter
Splitter
Splitter
BandPass Compress Process Expand
BandPass Compress Process Expand
BandPass Compress Process Expand
BandPass Compress Process Expand
Joiner
Joiner
BandStop
BandStop
Joiner
Adder
40Coarsening the Granularity
Splitter
Splitter
Splitter
BandPass Compress Process Expand
BandPass Compress Process Expand
BandPass Compress Process Expand
BandPass Compress Process Expand
Joiner
Joiner
Splitter
Splitter
BandStop
BandStop
BandStop
BandStop
Joiner
Joiner
Joiner
Splitter
Adder
Adder
Adder
Adder
Adder
Joiner
41Evaluation Coarse-Grained Data Parallelism
Good Parallelism! Low Synchronization!
42Simplified Vocoder
Splitter
6
6
Joiner
Data Parallel
20
RectPolar
Splitter
Splitter
2
2
Unwrap
UnWrap
1
1
Diff
Diff
1
1
Amplify
Amplify
1
1
Accum
Accum
Joiner
Joiner
Data Parallel
20
PolarRect
Target a 4-core machine
43Data Parallelize
Splitter
6
6
Joiner
RectPolar
Splitter
5
RectPolar
RectPolar
20
RectPolar
Joiner
Splitter
Splitter
2
2
Unwrap
UnWrap
1
1
Diff
Diff
1
1
Amplify
Amplify
1
1
Accum
Accum
Joiner
Joiner
RectPolar
5
Splitter
RectPolar
RectPolar
20
PolarRect
Joiner
Target a 4-core machine
44Data Task Parallel Execution
Cores
Time
21
Target a 4-core machine
45We Can Do Better
Cores
16
Time
Target a 4-core machine
46Coarse-Grained Software Pipelining
Prologue
New Steady State
47Evaluation Coarse-Grained Task Data
Software Pipelining
48Evaluation Coarse-Grained Task Data
Software Pipelining
Best Parallelism! Lowest Synchronization!
49Parallelism Take Away
- Stream programs have abundant parallelism
- However, parallelism is obfuscated in language
like C - Stream languages enable new effective mapping
- In C, analogous transformations impossibly
complex - In StreamC or Brook, similar transformations
possibleKhailany et al., IEEE Micro01 Buck
et al., SIGGRAPH04 Das et al., PACT06 - Results should extend to other multicores
- Parameters local memory, comm.-to-comp. cost
- Preliminary results on Cell are promising Zhang,
dasCMP07
Coarsen Granularity
Data Parallelize
Software Pipeline
50Part 3 Domain-Specific OptimizationsJoint
work with Andrew Lamb, Sitij Agrawal
- Andrew Lamb, William Thies, Saman Amarasinghe
(PLDI03) - Sitij Agrawal, William Thies, Saman Amarasinghe
(CASES05)
51DSP Optimization Process
AtoD
- Given specification of algorithm,minimize the
computation cost
FMDemod
Linear
Duplicate
LPF1
LPF2
LPF3
HPF1
HPF2
HPF3
RoundRobin
Adder
Speaker
52DSP Optimization Process
AtoD
- Given specification of algorithm,minimize the
computation cost
FMDemod
Equalizer
Speaker
53DSP Optimization Process
AtoD
- Given specification of algorithm,minimize the
computation cost
FMDemod
Equalizer
FFT
IFFT
54DSP Optimization Process
AtoD
- Given specification of algorithm,minimize the
computation cost - Currently done by hand (MATLAB)
- Can compiler replace DSP expert?
- Library generators limited Spiral FFTW
ATLAS - Enable unified development environment
FMDemod
FFT
Equalizer
IFFT
55Focus Linear State Space Filters
- Properties
- Outputs are linear function of inputs and states
- New states are linear function of inputs and
states - Most common target of DSP optimizations
- FIR / IIR filters
- Linear difference equations
- Upsamplers / downsamplers
- DCTs
inputs
u
states
x Ax Bu
y Cx Du
outputs
56Focus Linear State Space Filters
inputs
u
states
x Ax Bu
y Cx Du
outputs
57Focus Linear Filters
float-gtfloat filter Scale work push 2 pop 1
float u pop() push(u)
push(2u)
Linear dataflow analysis
inputs
u
y Du
outputs
58Focus Linear Filters
float-gtfloat filter Scale work push 2 pop 1
float u pop() push(u)
push(2u)
Linear dataflow analysis
inputs
u
y1y2
12
u
outputs
59Combining Adjacent Filters
u
y Du
Filter 1
y
Filter 2
z Ey
z
60Combination Example
u
Filter 1
u
CombinedFilter
y
Filter 2
z
z
61The General Case
- If matrix dimensions mis-match?
Matrix expansion
Original
Expanded
U
D
E
U
D
?
D
D
E
D
pop ?
62The General Case
- If matrix dimensions mis-match?
Matrix expansion
Original
Expanded
U
D
E
U
D
?
D
D
E
D
pop ?
63The General Case
Pipelines
Feedback Loops
64The General Case
Splitjoins
65Floating-Point Operations Reduction
0.3
66Floating-Point Operations Reduction
0.3
-140
67Radar (Transformation Selection)
68Radar (Transformation Selection)
69Radar (Transformation Selection)
70Radar (Transformation Selection)
Using Transformation Selection
71Floating Point Operations Reduction
0.3
-140
72Execution Speedup
5
On a Pentium IV
73Execution Speedup
Additional transformations 1. Eliminating
redundant states 2. Eliminating parameters
(non-zero, non-unary coefficients) 3. Translation
to the compressed domain
5
On a Pentium IV
74StreamIt Lessons Learned
- In practice, I/O rates of filters are often
matched LCTES03 - Over 30 publications study an uncommon case
(CD-DAT) - Multi-phase filters complicate programs,
compilers - Should maintain simplicity of only one atomic
step per filter - Programmers accidentally introduce mutable filter
state
1
2
3
2
7
8
7
5
x 147
x 98
x 28
x 32
voidgtint filter SquareWave() int x 0
work push 1 push(x) x 1 -
x
voidgtint filter SquareWave() work push 2
push(0) push(1)
stateful
stateless
75Future of StreamIt
- Goal influence the next big language
Origins of C
Structural influence
Feature influence
Fortran
Academic origin
1960
Algol 60
CPL
Simula 67
1970
BCPL
C
ML
1980
Algol 68
Clu
C with Classes
Ada
C
1990
ANSI C
Carm
Source B. Stroustrup, The Design and Evolution
of C
Cstd
76Research Trajectory
- Vision Make emerging computational substrates
universally accessible and useful - 1. Languages, compilers, tools for multicores
- I believe new language / compiler
technologycan enable scalable and robust
performance - Next inroads expose exploit flexibility
in programs - 2. Programmable microfluidics
- We have developed programming
languages,tools, and flexible new devices for
microfluidics - Potential to revolutionize biology
experimentation - 3. Technologies for the developing world
- TEK enable Internet experience over email
account - Audio Wiki publish content from a low-cost
phone - uBox / uPhone monitor improve rural
healthcare
77Conclusions
- A parallel programming model will succeed only by
luring programmers, making them do less, not more - Stream programminglures programmers with
- Elegant programming primitives
- Domain-specific optimizations
- Meanwhile, streamingis implicitly parallel
- Robust performance via task,data, pipeline
parallelism - We believe stream programming will play a key
rolein enabling a transition to multicore
processors
- Contributions
- Structured streams
- Teleport messaging
- Unified algorithm for task,data, pipeline
parallelism - Software pipelining of whole procedures
- Algebraic simplification ofwhole procedures
- Translation from time to frequency
- Selection of best DSP transforms
78Acknowledgments
- Project supervisors
- Prof. Saman Amarasinghe Dr. Rodric Rabbah
- Contributors to this talk
- Michael I. Gordon (Ph.D. Candidate) leads
StreamIt backend efforts - Andrew A. Lamb (M.Eng) led linear optimizations
- Sitij Agrawal (M.Eng) led statespace
optimizations - Compiler developers
- Kunal Agrawal
- Allyn Dimock
- Qiuyuan Jimmy Li
- Application developers
- Basier Aziz
- Matthew Brown
- Matthew Drake
- User interface developers
- Kimberly Kuo
Jasper Lin Michal Karczmarek David Maze
Janis Sermulins Phil Sung David Zhang
- Shirley Fung Hank Hoffmann Chris Leger
Ali Meli Satish Ramaswamy Jeremy Wong
Juan Reyes