Stream Programming: Luring Programmers into the Multicore Era Bill Thies - PowerPoint PPT Presentation

About This Presentation
Title:

Stream Programming: Luring Programmers into the Multicore Era Bill Thies

Description:

splitter. splitjoin. filter. Each structure is single-input, single-output ... Splitter. Joiner. Task. Parallelism in Stream Programs. Task parallelism ... – PowerPoint PPT presentation

Number of Views:40
Avg rating:3.0/5.0
Slides: 79
Provided by: BillT82
Category:

less

Transcript and Presenter's Notes

Title: Stream Programming: Luring Programmers into the Multicore Era Bill Thies


1
Stream Programming LuringProgrammers into the
Multicore EraBill Thies
  • Computer Science and Artificial Intelligence
    Laboratory
  • Massachusetts Institute of Technology
  • Spring 2008

2
Multicores are Here
512
Picochip PC102
Ambric AM2045
256
Cisco CSR-1
128
Intel Tflops
Tilera
64
32
of cores
Raza XLR
Cavium Octeon
Raw
16
Cell
8
Niagara
Opteron 4P
Broadcom 1480
4
Xeon MP
Xbox360
PA-8800
Opteron
Tanglewood
2
Power4
PExtreme
Power6
Yonah
4004
8086
8080
286
386
486
Pentium
P2
P3
Itanium
1
P4
8008
Itanium 2
Athlon
1985
1990
1980
1970
1975
1995
2000
2005
20??
3
Multicores are Here
512
Picochip PC102
Ambric AM2045
256
Cisco CSR-1
128
Intel Tflops
Tilera
64
32
of cores
Hardware wasresponsible for improving performance
Raza XLR
Cavium Octeon
Raw
16
Cell
8
Niagara
Opteron 4P
Broadcom 1480
4
Xeon MP
Xbox360
PA-8800
Opteron
Tanglewood
2
Power4
PExtreme
Power6
Yonah
4004
8086
8080
286
386
486
Pentium
P2
P3
Itanium
1
P4
8008
Itanium 2
Athlon
1985
1990
1980
1970
1975
1995
2000
2005
20??
4
Multicores are Here
512
Picochip PC102
Ambric AM2045
256
Cisco CSR-1
128
Intel Tflops
Tilera
64
32
of cores
Now, performanceburden falls on programmers
Raza XLR
Cavium Octeon
Raw
16
Cell
8
Niagara
Opteron 4P
Broadcom 1480
4
Xeon MP
Xbox360
PA-8800
Opteron
Tanglewood
2
Power4
PExtreme
Power6
Yonah
4004
8086
8080
286
386
486
Pentium
P2
P3
Itanium
1
P4
8008
Itanium 2
Athlon
1985
1990
1980
1970
1975
1995
2000
2005
20??
5
Is Parallel Programming a New Problem?
  • No! Decades of research targeting
    multiprocessors
  • Languages, compilers, architectures, tools
  • What is different today?
  • 1. Multicores vs. multiprocessors. Multicores
    have
  • - New interconnects with non-uniform
    communication costs
  • - Faster on-chip communication than off-chip
    I/O, memory ops
  • - Limited per-core memory availability
  • 2. Non-expert programmers
  • - Supercomputers with gt2048 processors today
    100 top500.org
  • - Machines with gt2048 cores in 2020 gt100
    million ITU, Moore
  • 3. Application trends
  • - Embedded 2.7 billion cell phones vs 850
    million PCs ITU 2006
  • - Data-centric YouTube streams 200 TB of video
    daily

6
Streaming Application Domain
AtoD
  • For programs based on streams of data
  • Audio, video, DSP, networking, and cryptographic
    processing kernels
  • Examples HDTV editing, radar tracking,
    microphone arrays, cell phone base stations,
    graphics

FMDemod
Duplicate
LPF1
LPF2
LPF3
HPF1
HPF2
HPF3
RoundRobin
Adder
Speaker
7
Streaming Application Domain
AtoD
  • For programs based on streams of data
  • Audio, video, DSP, networking, and cryptographic
    processing kernels
  • Examples HDTV editing, radar tracking,
    microphone arrays, cell phone base stations,
    graphics
  • Properties of stream programs
  • Regular and repeating computation
  • Independent filters with explicit communication
  • Data items have short lifetimes

FMDemod
Duplicate
LPF1
LPF2
LPF3
HPF1
HPF2
HPF3
RoundRobin
Adder
Speaker
8
Brief History of Streaming
1960
1970
1980
1990
2000
Models of Computation
Kahn Proc. Networks
Synchronous Dataflow
Petri Nets
Comp. Graphs
Communicating Sequential Processes
Modeling Environments
Ptolemy
Matlab/Simulink
etc.
Gabriel
Grape-II
Languages / Compilers
Sisal
Esterel
Erlang
Lucid
Id
pH
Occam
VAL
lazy
LUSTRE
C
9
Brief History of Streaming
1960
1970
1980
1990
2000
Models of Computation
Kahn Proc. Networks
Synchronous Dataflow
Petri Nets
Comp. Graphs
Communicating Sequential Processes
Modeling Environments
Ptolemy
Matlab/Simulink
etc.
Gabriel
Grape-II
Languages / Compilers
Sisal
Esterel
Erlang
Lucid
Id
pH
Occam
VAL
lazy
LUSTRE
C
  • Weaknesses
  • Unsuitable for static analysis
  • Cannot leverage deep results from DSP /
    modeling community
  • Strengths
  • Elegance
  • Generality

10
Brief History of Streaming
1960
1970
1980
1990
2000
Models of Computation
Kahn Proc. Networks
Synchronous Dataflow
Petri Nets
Comp. Graphs
Communicating Sequential Processes
Modeling Environments
Ptolemy
Matlab/Simulink
etc.
Gabriel
Grape-II
Languages / Compilers
Sisal
Esterel
Erlang
StreamIt
Brook
Lucid
Id
StreamC
pH
Cg
Occam
VAL
lazy
LUSTRE
C
  • Weaknesses
  • Unsuitable for static analysis
  • Cannot leverage deep results from DSP /
    modeling community
  • Strengths
  • Elegance
  • Generality

Stream Programming
11
StreamIt A Language and Compilerfor Stream
Programs
  • Key idea design language that enables static
    analysis
  • Goals
  • Expose and exploit the parallelism in stream
    programs
  • Improve programmer productivity in the streaming
    domain
  • Project contributions
  • Language design for streaming CC'02, CAN'02,
    PPoPP'05, IJPP'05
  • Automatic parallelization ASPLOS'02,
    G.Hardware'05, ASPLOS'06
  • Domain-specific optimizations PLDI'03, CASES'05,
    TechRep'07
  • Cache-aware scheduling LCTES'03, LCTES'05
  • Extracting streams from legacy code MICRO'07
  • User application studies PLDI'05, P-PHEC'05,
    IPDPS'06

7 years, 25 people, 300 KLOC 700 external
downloads, 5 external publications
12
StreamIt A Language and Compilerfor Stream
Programs
  • Key idea design language that enables static
    analysis
  • Goals
  • Expose and exploit the parallelism in stream
    programs
  • Improve programmer productivity in the streaming
    domain
  • I contributed to
  • Language design for streaming CC'02, CAN'02,
    PPoPP'05, IJPP'05
  • Automatic parallelization ASPLOS'02,
    G.Hardware'05, ASPLOS'06
  • Domain-specific optimizations PLDI'03, CASES'05,
    TechRep'07
  • Cache-aware scheduling LCTES'03, LCTES'05
  • Extracting streams from legacy code MICRO'07
  • User application studies PLDI'05, P-PHEC'05,
    IPDPS'06

7 years, 25 people, 300 KLOC 700 external
downloads, 5 external publications
13
StreamIt A Language and Compilerfor Stream
Programs
  • Key idea design language that enables static
    analysis
  • Goals
  • Expose and exploit the parallelism in stream
    programs
  • Improve programmer productivity in the streaming
    domain
  • This talk
  • Language design for streaming CC'02, CAN'02,
    PPoPP'05, IJPP'05
  • Automatic parallelization ASPLOS'02,
    G.Hardware'05, ASPLOS'06
  • Domain-specific optimizations PLDI'03, CASES'05,
    TechRep'07
  • Cache-aware scheduling LCTES'03, LCTES'05
  • Extracting streams from legacy code MICRO'07
  • User application studies PLDI'05, P-PHEC'05,
    IPDPS'06

7 years, 25 people, 300 KLOC 700 external
downloads, 5 external publications
14
Part 1 Language DesignJoint work with Michael
Gordon
  • William Thies, Michal Karczmarek, Saman
    Amarasinghe (CC02)
  • William Thies, Michal Karczmarek, Janis
    Sermulins, Rodric Rabbah,Saman Amarasinghe
    (PPoPP05)

15
StreamIt Language Basics
  • High-level, architecture-independent language
  • Backend support for uniprocessors, multicores
    (Raw, SMP), cluster of workstations
  • Model of computation synchronous dataflow
  • Program is a graph of independent filters
  • Filters have an atomic execution stepwith known
    input / output rates
  • Compiler is responsible for scheduling and
    buffer management
  • Extensions to synchronous dataflow
  • Dynamic I/O rates
  • Support for sliding window operations
  • Teleport messaging PPoPP05

Lee Messerschmidt, 1987
Input
x 10
1
10
Decimate
x 1
1
1
Output
x 1
16
Representing Streams
  • Conventional wisdom stream programs are graphs
  • Graphs have no simple textual representation
  • Graphs are difficult to analyze and optimize
  • Insight stream programs have structure

unstructured
17
Structured Streams
filter
  • Each structure is single-input, single-output
  • Hierarchical and composable

pipeline
may be any StreamIt language construct
splitjoin
joiner
splitter
feedback loop
splitter
joiner
18
Radar-Array Front End
19
Filterbank
20
FFT
21
Block Matrix Multiply
22
MP3 Decoder
23
Bitonic Sort
24
FM Radio with Equalizer
25
Ground Moving Target Indicator (GMTI)
  • 99 filters
  • 3566 filter instances

26
Example Syntax FMRadio
  • void-gtvoid pipeline FMRadio(int N, float lo,
    float hi)
  • add AtoD()
  • add FMDemod()
  • add splitjoin
  • split duplicate
  • for (int i0 iltN i)
  • add pipeline
  • add LowPassFilter(lo i(hi - lo)/N)
  • add HighPassFilter(lo i(hi - lo)/N)
  • join roundrobin()

AtoD
FMDemod
Duplicate
LPF1
LPF2
LPF3
HPF1
HPF2
HPF3
RoundRobin
Adder
Speaker
27
StreamIt Application Suite
  • Software radio
  • Frequency hopping radio
  • Acoustic beam former
  • Vocoder
  • FFTs and DCTs
  • JPEG Encoder/Decoder
  • MPEG-2 Encoder/Decoder
  • MPEG-4 (fragments)
  • Sorting algorithms
  • GMTI (Ground Moving Target Indicator)
  • DES and Serpent crypto algorithms
  • SSCA3 (HPCS scalable benchmark for synthetic
    aperture radar)
  • Mosaic imaging using RANSAC algorithm

Total size 60,000 lines of code
28
Control Messages
AtoD
  • Occasionally, low-bandwidth control messages are
    sent between actors
  • Often demands precise timing
  • Communications adjust protocol,amplification,
    compression
  • Network router cancel invalid packet
  • Adaptive beamformer track a target
  • Respond to user input, runtime errors
  • Frequency hopping radio
  • Traditional techniques
  • Direct method call (no timing guarantees)
  • Embed message in stream (opaque, slow)

Decode
duplicate
LPF2
LPF1
LPF3
HPF2
HPF1
HPF3
roundrobin
Encode
Transmit
29
Idea 2 Teleport Messaging
  • Looks like method call, but timed relative to
    data in the stream
  • Exposes dependences to compiler
  • Simple and precise for user
  • - Adjustable latency
  • - Can send upstream or downstream

TargetFilter x if newProtocol(p)
x.setProtocol(p) _at_ 2
void setProtocol(int p) reconfig(p)
30
Part 2 Automatic ParallelizationJoint work
with Michael Gordon
  • Michael I. Gordon, William Thies, Saman
    Amarasinghe (ASPLOS06)
  • Michael I. Gordon, William Thies, Michal
    Karczmarek, Jasper Lin, Ali S. Meli, Andrew A.
    Lamb, Chris Leger, Jeremy Wong, Henry Hoffmann,
    David Maze, Saman Amarasinghe (ASPLOS02)

31
Streaming is an Implicitly Parallel Model
  • Programmer thinks about functionality, not
    parallelism
  • More explicit models may
  • Require knowledge of target MPI cG
  • Require parallelism annotations OpenMP HPF
    Cilk Intel TBB
  • Novelty over other implicit models?Erlang
    MapReduce Sequoia pH Occam Sisal Id
    VAL LUSTREHAL THAL SALSA Rosette
    ABCL APL ZPL NESL

? Exploiting streaming structure for robust
performance
32
Parallelism in Stream Programs
  • Task parallelism
  • Analogous to thread (fork/join) parallelism
  • Data Parallelism
  • Peel iterations of filter, place within
    scatter/gather pair (fission)
  • parallelize filters with state
  • Pipeline Parallelism
  • Between producers and consumers
  • Stateful filters can be parallelized

33
Parallelism in Stream Programs
  • Task parallelism
  • Analogous to thread (fork/join) parallelism
  • Data parallelism
  • Analogous to DOALL loops
  • Pipeline parallelism
  • Analogous to ILP that is exploited in hardware

Splitter
Stateless
Joiner
Splitter
Pipeline
Joiner
Data
Task
34
Baseline Fine-Grained Data Parallelism
Splitter
Splitter
Splitter
BandPass
BandPass
BandPass
BandPass
BandPass
BandPass
BandPass
BandPass
Joiner
Joiner
Splitter
Splitter
Compress
Compress
Compress
Compress
Compress
Compress
Compress
Compress
Joiner
Joiner
Splitter
Splitter
Process
Process
Process
Process
Process
Process
Process
Process
Joiner
Joiner
Splitter
Splitter
Expand
Expand
Expand
Expand
Expand
Expand
Expand
Expand
Joiner
Joiner
BandStop
Splitter
BandStop
Splitter
BandStop
BandStop
BandStop
BandStop
BandStop
BandStop
Joiner
Joiner
Joiner
BandStop
Splitter
BandStop
BandStop
Adder
Adder
Joiner
35
EvaluationFine-Grained Data Parallelism
36
EvaluationFine-Grained Data Parallelism
Good Parallelism! Too Much Synchronization!
37
Coarsening the Granularity
Splitter
BandPass
BandPass
Compress
Compress
Process
Process
Expand
Expand
BandStop
BandStop
Joiner
Adder
38
Coarsening the Granularity
39
Coarsening the Granularity
Splitter
Splitter
Splitter
BandPass Compress Process Expand
BandPass Compress Process Expand
BandPass Compress Process Expand
BandPass Compress Process Expand
Joiner
Joiner
BandStop
BandStop
Joiner
Adder
40
Coarsening the Granularity
Splitter
Splitter
Splitter
BandPass Compress Process Expand
BandPass Compress Process Expand
BandPass Compress Process Expand
BandPass Compress Process Expand
Joiner
Joiner
Splitter
Splitter
BandStop
BandStop
BandStop
BandStop
Joiner
Joiner
Joiner
Splitter
Adder
Adder
Adder
Adder
Adder
Joiner
41
Evaluation Coarse-Grained Data Parallelism
Good Parallelism! Low Synchronization!
42
Simplified Vocoder
Splitter
6
6
Joiner
Data Parallel
20
RectPolar
Splitter
Splitter
2
2
Unwrap
UnWrap
1
1
Diff
Diff
1
1
Amplify
Amplify
1
1
Accum
Accum
Joiner
Joiner
Data Parallel
20
PolarRect
Target a 4-core machine
43
Data Parallelize
Splitter
6
6
Joiner
RectPolar
Splitter
5
RectPolar
RectPolar
20
RectPolar
Joiner
Splitter
Splitter
2
2
Unwrap
UnWrap
1
1
Diff
Diff
1
1
Amplify
Amplify
1
1
Accum
Accum
Joiner
Joiner
RectPolar
5
Splitter
RectPolar
RectPolar
20
PolarRect
Joiner
Target a 4-core machine
44
Data Task Parallel Execution
Cores
Time
21
Target a 4-core machine
45
We Can Do Better
Cores
16
Time
Target a 4-core machine
46
Coarse-Grained Software Pipelining
Prologue
New Steady State
47
Evaluation Coarse-Grained Task Data
Software Pipelining
48
Evaluation Coarse-Grained Task Data
Software Pipelining
Best Parallelism! Lowest Synchronization!
49
Parallelism Take Away
  • Stream programs have abundant parallelism
  • However, parallelism is obfuscated in language
    like C
  • Stream languages enable new effective mapping
  • In C, analogous transformations impossibly
    complex
  • In StreamC or Brook, similar transformations
    possibleKhailany et al., IEEE Micro01 Buck
    et al., SIGGRAPH04 Das et al., PACT06
  • Results should extend to other multicores
  • Parameters local memory, comm.-to-comp. cost
  • Preliminary results on Cell are promising Zhang,
    dasCMP07

Coarsen Granularity
Data Parallelize
Software Pipeline
50
Part 3 Domain-Specific OptimizationsJoint
work with Andrew Lamb, Sitij Agrawal
  • Andrew Lamb, William Thies, Saman Amarasinghe
    (PLDI03)
  • Sitij Agrawal, William Thies, Saman Amarasinghe
    (CASES05)

51
DSP Optimization Process
AtoD
  • Given specification of algorithm,minimize the
    computation cost

FMDemod
Linear
Duplicate
LPF1
LPF2
LPF3
HPF1
HPF2
HPF3
RoundRobin
Adder
Speaker
52
DSP Optimization Process
AtoD
  • Given specification of algorithm,minimize the
    computation cost

FMDemod
Equalizer
Speaker
53
DSP Optimization Process
AtoD
  • Given specification of algorithm,minimize the
    computation cost

FMDemod
Equalizer
FFT
IFFT
54
DSP Optimization Process
AtoD
  • Given specification of algorithm,minimize the
    computation cost
  • Currently done by hand (MATLAB)
  • Can compiler replace DSP expert?
  • Library generators limited Spiral FFTW
    ATLAS
  • Enable unified development environment

FMDemod
FFT
Equalizer
IFFT
55
Focus Linear State Space Filters
  • Properties
  • Outputs are linear function of inputs and states
  • New states are linear function of inputs and
    states
  • Most common target of DSP optimizations
  • FIR / IIR filters
  • Linear difference equations
  • Upsamplers / downsamplers
  • DCTs

inputs
u
states

x Ax Bu
y Cx Du
outputs
56
Focus Linear State Space Filters
inputs
u
states

x Ax Bu
y Cx Du
outputs
57
Focus Linear Filters

float-gtfloat filter Scale work push 2 pop 1
float u pop() push(u)
push(2u)
Linear dataflow analysis
inputs
u

y Du
outputs
58
Focus Linear Filters

float-gtfloat filter Scale work push 2 pop 1
float u pop() push(u)
push(2u)
Linear dataflow analysis
inputs
u

y1y2
12
u

outputs
59
Combining Adjacent Filters
u

y Du
Filter 1
y

Filter 2
z Ey
z
60
Combination Example
u

Filter 1
u

CombinedFilter
y

Filter 2
z
z
61
The General Case
  • If matrix dimensions mis-match?

Matrix expansion
Original
Expanded
U
D
E
U
D
?
D
D
E
D
pop ?
62
The General Case
  • If matrix dimensions mis-match?

Matrix expansion
Original
Expanded
U
D
E
U
D
?
D
D
E
D
pop ?
63
The General Case
Pipelines
Feedback Loops
64
The General Case
Splitjoins
65
Floating-Point Operations Reduction
0.3
66
Floating-Point Operations Reduction
0.3
-140
67
Radar (Transformation Selection)
68
Radar (Transformation Selection)
69
Radar (Transformation Selection)
70
Radar (Transformation Selection)
Using Transformation Selection
71
Floating Point Operations Reduction
0.3
-140
72
Execution Speedup
5
On a Pentium IV
73
Execution Speedup
Additional transformations 1. Eliminating
redundant states 2. Eliminating parameters
(non-zero, non-unary coefficients) 3. Translation
to the compressed domain
5
On a Pentium IV
74
StreamIt Lessons Learned
  • In practice, I/O rates of filters are often
    matched LCTES03
  • Over 30 publications study an uncommon case
    (CD-DAT)
  • Multi-phase filters complicate programs,
    compilers
  • Should maintain simplicity of only one atomic
    step per filter
  • Programmers accidentally introduce mutable filter
    state

1
2
3
2
7
8
7
5
x 147
x 98
x 28
x 32
voidgtint filter SquareWave() int x 0
work push 1 push(x) x 1 -
x
voidgtint filter SquareWave() work push 2
push(0) push(1)
stateful
stateless
75
Future of StreamIt
  • Goal influence the next big language

Origins of C
Structural influence
Feature influence
Fortran
Academic origin
1960
Algol 60
CPL
Simula 67
1970
BCPL
C
ML
1980
Algol 68
Clu
C with Classes
Ada
C
1990
ANSI C
Carm
Source B. Stroustrup, The Design and Evolution
of C
Cstd
76
Research Trajectory
  • Vision Make emerging computational substrates
    universally accessible and useful
  • 1. Languages, compilers, tools for multicores
  • I believe new language / compiler
    technologycan enable scalable and robust
    performance
  • Next inroads expose exploit flexibility
    in programs
  • 2. Programmable microfluidics
  • We have developed programming
    languages,tools, and flexible new devices for
    microfluidics
  • Potential to revolutionize biology
    experimentation
  • 3. Technologies for the developing world
  • TEK enable Internet experience over email
    account
  • Audio Wiki publish content from a low-cost
    phone
  • uBox / uPhone monitor improve rural
    healthcare

77
Conclusions
  • A parallel programming model will succeed only by
    luring programmers, making them do less, not more
  • Stream programminglures programmers with
  • Elegant programming primitives
  • Domain-specific optimizations
  • Meanwhile, streamingis implicitly parallel
  • Robust performance via task,data, pipeline
    parallelism
  • We believe stream programming will play a key
    rolein enabling a transition to multicore
    processors
  • Contributions
  • Structured streams
  • Teleport messaging
  • Unified algorithm for task,data, pipeline
    parallelism
  • Software pipelining of whole procedures
  • Algebraic simplification ofwhole procedures
  • Translation from time to frequency
  • Selection of best DSP transforms

78
Acknowledgments
  • Project supervisors
  • Prof. Saman Amarasinghe Dr. Rodric Rabbah
  • Contributors to this talk
  • Michael I. Gordon (Ph.D. Candidate) leads
    StreamIt backend efforts
  • Andrew A. Lamb (M.Eng) led linear optimizations
  • Sitij Agrawal (M.Eng) led statespace
    optimizations
  • Compiler developers
  • Kunal Agrawal
  • Allyn Dimock
  • Qiuyuan Jimmy Li
  • Application developers
  • Basier Aziz
  • Matthew Brown
  • Matthew Drake
  • User interface developers
  • Kimberly Kuo

Jasper Lin Michal Karczmarek David Maze
Janis Sermulins Phil Sung David Zhang
  • Shirley Fung Hank Hoffmann Chris Leger

Ali Meli Satish Ramaswamy Jeremy Wong
Juan Reyes
Write a Comment
User Comments (0)
About PowerShow.com