Title: Orchestration by Approximation Mapping Stream Programs onto Multicore Architectures
1Orchestration by Approximation Mapping Stream
Programs onto Multicore Architectures
- S. M. Farhad (University of Sydney)
- Joint work with
- Yousun Ko
- Bernd Burgstaller
- Bernhard Scholz
2Outline
- Motivation
- Multicore
- Stream programming
- Research question
- Our work
- Contributions
- Overview
- Details
- Summary
2
3Cores are the New Gates
(Shekhar Borkar, Intel)
Stream Programming
CUDA
Courtesy Scott08
X10
Peakstream
Fortress
C/C/Java
cores/chip
Accelerator
Ct
C T M
Rstream
Rapidmind
3
4Stream Programming Paradigm
- Research topic in parallel programming
- Various forms of parallelism
- Pipeline, task, and data
- Applications
- Signal Processing
- Multi-media
- High-Performance Computing
- Programs expressed as stream graphs
- Streams
- Infinite sequence of data elements (aka. Tokens)
- Actor
- Functions applied to streams
4
5Properties of Stream Program
AtoD
FMDemod
- Regular and repeating computation
- Independent actors with explicit communication
- Producer / Consumer dependencies
Splitter
LPF1
LPF2
LPF3
HPF1
HPF2
HPF3
Joiner
Adder
Speaker
5
6StreamIt Language ASPLOS26, PLDI3
- An implementation of stream programming
- Model of computation
- Synchronous data flow model
- Program is a graph of independent actors and
streams - Actors have an execution step with known
input/output rates - Compiler schedules and manages buffers
1
5
1
1
A
B
C
x
5
x
1
x
1
6
7StreamIt Language Contd.
filter
pipeline
- Each construct has single input/output stream
- Hierarchical structure
- Filters can be stateful/stateless
may be any StreamIt language construct
splitjoin
parallel computation
splitter
joiner
feedback loop
joiner
splitter
7
8Research Question How to Orchestrate a Stream
Graph?
- Mapping actors
- Eliminate bottlenecks (aka. hot actors)
8
9We Focus on?
- Algorithm for actor allocation
5
A
Core 1
Core 2
Core 3
B
60
C
60
D
5
Make span 60, Speedup 130/60 2.17
9
10We Focus on?
Make span 47, Speedup 130/47 2.77
10
11Orchestration of Stream Program Contd.
- Current state of the art
- Integer Linear Programming
- Intractable
- Heuristics
- Unknown performance
- How to find a fast and good solution?
- Approximation algorithms that have
- Polynomial runtime
- Quality bound for solution
11
12Our Contribution
- A simple quantitative analysis to detect and
eliminate bottlenecks - A novel 2-approximation algorithm for deploying
stream graphs on multicore platforms - Experiments and comparison
- Results are within 5 of the optimal solution
- Achieves a geometric mean speedup of 6.95x for 8
processors over single processor execution
12
13Focus of Our Work
StreamIt Compiler Phases
Stream Graph Scheduling
Linear Functional Equation Solver
Stream Graph Partitioning
Bottleneck Resolver
Layout on Target Architecture
Actor Allocation on Processors
Communication Scheduling
13
13
14Our Data Transfer Model
- Arrival rate depends on the data rate of the
actors (maximize) - Data transfer model forms a system of sim.
functional linear equation - Compute a closed form of the output data rate
- We also consider a processor utilization function
for each actor
14
15Bottleneck Analysis
- The arrival rate is limited by
- Processor capacity of the cores
- Memory bandwidth
- A quantitative analysis determines
- An upper bound of the arrival rate imposed by an
actor - An upper bound of the arrival rate imposed by the
parallel system - Hot actor
- Upper bound (actor) lt upper bound (system)
15
16Resolving Bottlenecks
16
17Actor Allocation Problem
- The actor allocation problem (AAP) is NP-hard
- For a fixed arrival rate, the AAP reduces to
standard bin-packing problem (closed form) - There exist approximation algorithms for
bin-packing - Polynomial running time
- Solution quality is bounded
18Actor Allocation Constraint
Actors with their utilizations
100
Each core has 100 utilization
18
19Binary Search
0
ub(z)
1.0
19
20Binary Search
0
ub(z)
1.0
20
20
21Binary Search
0
ub(z)
1.0
21
21
22Actor Allocation of Bottleneck Free Program
Mapping
Make span 45, Speedup 130/45 2.89
Efficient Bottleneck Resolving
22
23Experiments
- Our method implemented as an extension of
StreamIt compiler - We compare to ILP based method Scott 08
- (solved with CPLEX)
- Hardware Setup
- 2.33GHz dual quad-core Intel Xeon processors
- 16GB memory
- Linux kernel version 2.6.23
- Profiler uses the x86-64s hardware cycle
counters
23
24Experiments Contd.
- Experimental Process
- Profiling
- Computing closed form
- Resolve bottlenecks
- Compute the mapping
- Compute the layout scheduling
- Invoke the StreamIt back end
- Finally we measure the performance
24
25Experiments Contd.
Benchmark Actors Stateful
DCT 22 18
FMRadio 67 23
TDE 55 27
FFT 26 14
MergeSort 31 2
FilterBankNew 53 34
RadixSort 13 2
EqualizerProgram 65 22
BitonicSort 452 2
DES 375 180
MPEG 39 7
MatrixMult 52 2
25
26Experimental Results for 2 4 Processors
Benchmark Proc ILP Time (Optimal)(s)
DCT 2 0.27
DCT 3 1585.69
DCT 4 2285.01
FMRadio 2 0.08
FMRadio 3 3.22
FMRadio 4 1.29
TDE 2 0.09
TDE 3 0.17
TDE 4 274.69
FFT 2 0.46
FFT 3 44694.25
FFT 4 249240.09
Benchmark Proc ILP Time (Optimal)(s)
Equalizer 2 0.06
Equalizer 3 5.29
Equalizer 4 57553.83
BitonicSort 2 0.3
BitonicSort 3 3.06
BitonicSort 4 16371.99
DES 2 0.51
DES 3 2.73
DES 4 11.24
MPEG 2 0.12
MPEG 3 1.37
MPEG 4 0.44
26
Our methods run time lt1s
27Experimental Results for 2 4 Processors
Benchmark Proc Arrival rate ratio (Appx/Optimal) Apx. Arrival Rate (MBps)
Equalizer 2 1.00 0.56
Equalizer 3 1.00 0.83
Equalizer 4 1.00 1.59
BitonicSort 2 1.00 3.16
BitonicSort 3 1.00 4.73
BitonicSort 4 1.00 10.14
DES 2 1.00 0.12
DES 3 1.00 0.18
DES 4 1.00 0.24
MPEG 2 1.00 36.59
MPEG 3 0.99 54.68
MPEG 4 1.00 73.22
Benchmark Proc Arrival rate ratio (Appx/Optimal) Apx. Arrival Rate (MBps)
DCT 2 0.99 42.56
DCT 3 0.97 62.46
DCT 4 0.96 82.57
FMRadio 2 1.00 2.69
FMRadio 3 1.00 4.00
FMRadio 4 0.99 5.34
TDE 2 1.00 13.31
TDE 3 1.00 19.80
TDE 4 1.00 28.81
FFT 2 0.98 43.91
FFT 3 0.98 46.85
FFT 4 0.95 95.50
27
28Speedup Results
28
29Summary
- Approximation algorithm for solving actor
allocation problem - Data rate transfer model that resolves
bottlenecks - We separate the bottleneck elimination from the
actor allocation - We implemented our approach and compared with an
optimal approach - Optimal approach has unpredictable time
- Our approach has negligible time for all
benchmarks - Quality of our approach is at most 5 off the
optimum - For up to 8 processors we achieve a geometric
mean speedup of 6.95x over single processor
execution
29
30Related Works
- 1 Static Scheduling of SDF Programs for DSP
Lee 87 - 2 StreamIt A language for streaming
applications Thies 02 - 3 Phased Scheduling of Stream Programs Thies
03 - 4 Exploiting Coarse Grained Task, Data, and
Pipeline Parallelism in - Stream Programs Thies 06
- 5 Orchestrating the Execution of Stream
Programs on Cell Scott 08 - 6 Software Pipelined Execution of Stream
Programs on GPUs - Udupa09
- 7 Synergistic Execution of Stream Programs on
Multicores with - Accelerators Udupa 09
30
31Questions?