Orchestration by Approximation Mapping Stream Programs onto Multicore Architectures - PowerPoint PPT Presentation

1 / 31

About This Presentation

Title:

Orchestration by Approximation Mapping Stream Programs onto Multicore Architectures

Description:

Orchestration by Approximation Mapping Stream Programs onto Multicore Architectures S. M. Farhad (University of Sydney) Joint work with Yousun Ko – PowerPoint PPT presentation

Number of Views:108

Avg rating:3.0/5.0

Slides: 32

Provided by: Sarda9

Category:

more less

Transcript and Presenter's Notes

Title: Orchestration by Approximation Mapping Stream Programs onto Multicore Architectures

1
Orchestration by Approximation Mapping Stream
Programs onto Multicore Architectures

S. M. Farhad (University of Sydney)
Joint work with
Yousun Ko
Bernd Burgstaller
Bernhard Scholz

2
Outline

Motivation
Multicore
Stream programming
Research question
Our work
Contributions
Overview
Details
Summary

2
3
Cores are the New Gates
(Shekhar Borkar, Intel)
Stream Programming
CUDA
Courtesy Scott08
X10
Peakstream
Fortress
C/C/Java
cores/chip
Accelerator
Ct
C T M
Rstream
Rapidmind
3
4
Stream Programming Paradigm

Research topic in parallel programming
Various forms of parallelism
Pipeline, task, and data
Applications
Signal Processing
Multi-media
High-Performance Computing
Programs expressed as stream graphs
Streams
Infinite sequence of data elements (aka. Tokens)
Actor
Functions applied to streams

4
5
Properties of Stream Program
AtoD
FMDemod

Regular and repeating computation
Independent actors with explicit communication
Producer / Consumer dependencies

Splitter
LPF1
LPF2
LPF3
HPF1
HPF2
HPF3
Joiner
Adder
Speaker
5
6
StreamIt Language ASPLOS26, PLDI3

An implementation of stream programming
Model of computation
Synchronous data flow model
Program is a graph of independent actors and
streams
Actors have an execution step with known
input/output rates
Compiler schedules and manages buffers

1
5
1
1
A
B
C
x
5
x
1
x
1
6
7
StreamIt Language Contd.
filter
pipeline

Each construct has single input/output stream
Hierarchical structure
Filters can be stateful/stateless

may be any StreamIt language construct
splitjoin
parallel computation
splitter
joiner
feedback loop
joiner
splitter
7
8
Research Question How to Orchestrate a Stream
Graph?

Mapping actors
Eliminate bottlenecks (aka. hot actors)

8
9
We Focus on?

Algorithm for actor allocation

5
A
Core 1
Core 2
Core 3
B
60
C
60
D
5
Make span 60, Speedup 130/60 2.17
9
10
We Focus on?

Bottlenecks elimination

Make span 47, Speedup 130/47 2.77
10
11
Orchestration of Stream Program Contd.

Current state of the art
Integer Linear Programming
Intractable
Heuristics
Unknown performance
How to find a fast and good solution?
Approximation algorithms that have
Polynomial runtime
Quality bound for solution

11
12
Our Contribution

A simple quantitative analysis to detect and
eliminate bottlenecks
A novel 2-approximation algorithm for deploying
stream graphs on multicore platforms
Experiments and comparison
Results are within 5 of the optimal solution
Achieves a geometric mean speedup of 6.95x for 8
processors over single processor execution

12
13
Focus of Our Work
StreamIt Compiler Phases
Stream Graph Scheduling
Linear Functional Equation Solver
Stream Graph Partitioning
Bottleneck Resolver
Layout on Target Architecture
Actor Allocation on Processors
Communication Scheduling
13
13
14
Our Data Transfer Model

Arrival rate depends on the data rate of the
actors (maximize)
Data transfer model forms a system of sim.
functional linear equation
Compute a closed form of the output data rate
We also consider a processor utilization function
for each actor

14
15
Bottleneck Analysis

The arrival rate is limited by
Processor capacity of the cores
Memory bandwidth
A quantitative analysis determines
An upper bound of the arrival rate imposed by an
actor
An upper bound of the arrival rate imposed by the
parallel system
Hot actor
Upper bound (actor) lt upper bound (system)

15
16
Resolving Bottlenecks
16
17
Actor Allocation Problem

The actor allocation problem (AAP) is NP-hard
For a fixed arrival rate, the AAP reduces to
standard bin-packing problem (closed form)
There exist approximation algorithms for
bin-packing
Polynomial running time
Solution quality is bounded

18
Actor Allocation Constraint
Actors with their utilizations
100
Each core has 100 utilization
18
19
Binary Search
0
ub(z)
1.0
19
20
Binary Search
0
ub(z)
1.0
20
20
21
Binary Search
0
ub(z)
1.0
21
21
22
Actor Allocation of Bottleneck Free Program
Mapping
Make span 45, Speedup 130/45 2.89
Efficient Bottleneck Resolving
22
23
Experiments

Our method implemented as an extension of
StreamIt compiler
We compare to ILP based method Scott 08
(solved with CPLEX)
Hardware Setup
2.33GHz dual quad-core Intel Xeon processors
16GB memory
Linux kernel version 2.6.23
Profiler uses the x86-64s hardware cycle
counters

23
24
Experiments Contd.

Experimental Process
Profiling
Computing closed form
Resolve bottlenecks
Compute the mapping
Compute the layout scheduling
Invoke the StreamIt back end
Finally we measure the performance

24
25
Experiments Contd.
Benchmark Actors Stateful
DCT 22 18
FMRadio 67 23
TDE 55 27
FFT 26 14
MergeSort 31 2
FilterBankNew 53 34
RadixSort 13 2
EqualizerProgram 65 22
BitonicSort 452 2
DES 375 180
MPEG 39 7
MatrixMult 52 2
25
26
Experimental Results for 2 4 Processors
Benchmark Proc ILP Time (Optimal)(s)
DCT 2 0.27
DCT 3 1585.69
DCT 4 2285.01
FMRadio 2 0.08
FMRadio 3 3.22
FMRadio 4 1.29
TDE 2 0.09
TDE 3 0.17
TDE 4 274.69
FFT 2 0.46
FFT 3 44694.25
FFT 4 249240.09
Benchmark Proc ILP Time (Optimal)(s)
Equalizer 2 0.06
Equalizer 3 5.29
Equalizer 4 57553.83
BitonicSort 2 0.3
BitonicSort 3 3.06
BitonicSort 4 16371.99
DES 2 0.51
DES 3 2.73
DES 4 11.24
MPEG 2 0.12
MPEG 3 1.37
MPEG 4 0.44
26
Our methods run time lt1s
27
Experimental Results for 2 4 Processors
Benchmark Proc Arrival rate ratio (Appx/Optimal) Apx. Arrival Rate (MBps)
Equalizer 2 1.00 0.56
Equalizer 3 1.00 0.83
Equalizer 4 1.00 1.59
BitonicSort 2 1.00 3.16
BitonicSort 3 1.00 4.73
BitonicSort 4 1.00 10.14
DES 2 1.00 0.12
DES 3 1.00 0.18
DES 4 1.00 0.24
MPEG 2 1.00 36.59
MPEG 3 0.99 54.68
MPEG 4 1.00 73.22
Benchmark Proc Arrival rate ratio (Appx/Optimal) Apx. Arrival Rate (MBps)
DCT 2 0.99 42.56
DCT 3 0.97 62.46
DCT 4 0.96 82.57
FMRadio 2 1.00 2.69
FMRadio 3 1.00 4.00
FMRadio 4 0.99 5.34
TDE 2 1.00 13.31
TDE 3 1.00 19.80
TDE 4 1.00 28.81
FFT 2 0.98 43.91
FFT 3 0.98 46.85
FFT 4 0.95 95.50
27
28
Speedup Results
28
29
Summary

Approximation algorithm for solving actor
allocation problem
Data rate transfer model that resolves
bottlenecks
We separate the bottleneck elimination from the
actor allocation
We implemented our approach and compared with an
optimal approach
Optimal approach has unpredictable time
Our approach has negligible time for all
benchmarks
Quality of our approach is at most 5 off the
optimum
For up to 8 processors we achieve a geometric
mean speedup of 6.95x over single processor
execution