Orchestration by Approximation Mapping Stream Programs onto Multicore Architectures - PowerPoint PPT Presentation

1 / 31
About This Presentation
Title:

Orchestration by Approximation Mapping Stream Programs onto Multicore Architectures

Description:

Orchestration by Approximation Mapping Stream Programs onto Multicore Architectures S. M. Farhad (University of Sydney) Joint work with Yousun Ko – PowerPoint PPT presentation

Number of Views:108
Avg rating:3.0/5.0
Slides: 32
Provided by: Sarda9
Category:

less

Transcript and Presenter's Notes

Title: Orchestration by Approximation Mapping Stream Programs onto Multicore Architectures


1
Orchestration by Approximation Mapping Stream
Programs onto Multicore Architectures
  • S. M. Farhad (University of Sydney)
  • Joint work with
  • Yousun Ko
  • Bernd Burgstaller
  • Bernhard Scholz

2
Outline
  • Motivation
  • Multicore
  • Stream programming
  • Research question
  • Our work
  • Contributions
  • Overview
  • Details
  • Summary

2
3
Cores are the New Gates
(Shekhar Borkar, Intel)
Stream Programming
CUDA
Courtesy Scott08
X10
Peakstream
Fortress
C/C/Java
cores/chip
Accelerator
Ct
C T M
Rstream
Rapidmind
3
4
Stream Programming Paradigm
  • Research topic in parallel programming
  • Various forms of parallelism
  • Pipeline, task, and data
  • Applications
  • Signal Processing
  • Multi-media
  • High-Performance Computing
  • Programs expressed as stream graphs
  • Streams
  • Infinite sequence of data elements (aka. Tokens)
  • Actor
  • Functions applied to streams

4
5
Properties of Stream Program
AtoD
FMDemod
  • Regular and repeating computation
  • Independent actors with explicit communication
  • Producer / Consumer dependencies

Splitter
LPF1
LPF2
LPF3
HPF1
HPF2
HPF3
Joiner
Adder
Speaker
5
6
StreamIt Language ASPLOS26, PLDI3
  • An implementation of stream programming
  • Model of computation
  • Synchronous data flow model
  • Program is a graph of independent actors and
    streams
  • Actors have an execution step with known
    input/output rates
  • Compiler schedules and manages buffers

1
5
1
1
A
B
C
x
5
x
1
x
1
6
7
StreamIt Language Contd.
filter
pipeline
  • Each construct has single input/output stream
  • Hierarchical structure
  • Filters can be stateful/stateless

may be any StreamIt language construct
splitjoin
parallel computation
splitter
joiner
feedback loop
joiner
splitter
7
8
Research Question How to Orchestrate a Stream
Graph?
  • Mapping actors
  • Eliminate bottlenecks (aka. hot actors)

8
9
We Focus on?
  • Algorithm for actor allocation

5
A
Core 1
Core 2
Core 3
B
60
C
60
D
5
Make span 60, Speedup 130/60 2.17
9
10
We Focus on?
  • Bottlenecks elimination

Make span 47, Speedup 130/47 2.77
10
11
Orchestration of Stream Program Contd.
  • Current state of the art
  • Integer Linear Programming
  • Intractable
  • Heuristics
  • Unknown performance
  • How to find a fast and good solution?
  • Approximation algorithms that have
  • Polynomial runtime
  • Quality bound for solution

11
12
Our Contribution
  • A simple quantitative analysis to detect and
    eliminate bottlenecks
  • A novel 2-approximation algorithm for deploying
    stream graphs on multicore platforms
  • Experiments and comparison
  • Results are within 5 of the optimal solution
  • Achieves a geometric mean speedup of 6.95x for 8
    processors over single processor execution

12
13
Focus of Our Work
StreamIt Compiler Phases
Stream Graph Scheduling
Linear Functional Equation Solver
Stream Graph Partitioning
Bottleneck Resolver
Layout on Target Architecture
Actor Allocation on Processors
Communication Scheduling
13
13
14
Our Data Transfer Model
  • Arrival rate depends on the data rate of the
    actors (maximize)
  • Data transfer model forms a system of sim.
    functional linear equation
  • Compute a closed form of the output data rate
  • We also consider a processor utilization function
    for each actor

14
15
Bottleneck Analysis
  • The arrival rate is limited by
  • Processor capacity of the cores
  • Memory bandwidth
  • A quantitative analysis determines
  • An upper bound of the arrival rate imposed by an
    actor
  • An upper bound of the arrival rate imposed by the
    parallel system
  • Hot actor
  • Upper bound (actor) lt upper bound (system)

15
16
Resolving Bottlenecks
16
17
Actor Allocation Problem
  • The actor allocation problem (AAP) is NP-hard
  • For a fixed arrival rate, the AAP reduces to
    standard bin-packing problem (closed form)
  • There exist approximation algorithms for
    bin-packing
  • Polynomial running time
  • Solution quality is bounded

18
Actor Allocation Constraint
Actors with their utilizations
100
Each core has 100 utilization
18
19
Binary Search
0
ub(z)
1.0
19
20
Binary Search
0
ub(z)
1.0
20
20
21
Binary Search
0
ub(z)
1.0
21
21
22
Actor Allocation of Bottleneck Free Program
Mapping
Make span 45, Speedup 130/45 2.89
Efficient Bottleneck Resolving
22
23
Experiments
  • Our method implemented as an extension of
    StreamIt compiler
  • We compare to ILP based method Scott 08
  • (solved with CPLEX)
  • Hardware Setup
  • 2.33GHz dual quad-core Intel Xeon processors
  • 16GB memory
  • Linux kernel version 2.6.23
  • Profiler uses the x86-64s hardware cycle
    counters

23
24
Experiments Contd.
  • Experimental Process
  • Profiling
  • Computing closed form
  • Resolve bottlenecks
  • Compute the mapping
  • Compute the layout scheduling
  • Invoke the StreamIt back end
  • Finally we measure the performance

24
25
Experiments Contd.
Benchmark Actors Stateful
DCT 22 18
FMRadio 67 23
TDE 55 27
FFT 26 14
MergeSort 31 2
FilterBankNew 53 34
RadixSort 13 2
EqualizerProgram 65 22
BitonicSort 452 2
DES 375 180
MPEG 39 7
MatrixMult 52 2
25
26
Experimental Results for 2 4 Processors
Benchmark Proc ILP Time (Optimal)(s)
DCT 2 0.27
DCT 3 1585.69
DCT 4 2285.01
FMRadio 2 0.08
FMRadio 3 3.22
FMRadio 4 1.29
TDE 2 0.09
TDE 3 0.17
TDE 4 274.69
FFT 2 0.46
FFT 3 44694.25
FFT 4 249240.09
Benchmark Proc ILP Time (Optimal)(s)
Equalizer 2 0.06
Equalizer 3 5.29
Equalizer 4 57553.83
BitonicSort 2 0.3
BitonicSort 3 3.06
BitonicSort 4 16371.99
DES 2 0.51
DES 3 2.73
DES 4 11.24
MPEG 2 0.12
MPEG 3 1.37
MPEG 4 0.44
26
Our methods run time lt1s
27
Experimental Results for 2 4 Processors
Benchmark Proc Arrival rate ratio (Appx/Optimal) Apx. Arrival Rate (MBps)
Equalizer 2 1.00 0.56
Equalizer 3 1.00 0.83
Equalizer 4 1.00 1.59
BitonicSort 2 1.00 3.16
BitonicSort 3 1.00 4.73
BitonicSort 4 1.00 10.14
DES 2 1.00 0.12
DES 3 1.00 0.18
DES 4 1.00 0.24
MPEG 2 1.00 36.59
MPEG 3 0.99 54.68
MPEG 4 1.00 73.22
Benchmark Proc Arrival rate ratio (Appx/Optimal) Apx. Arrival Rate (MBps)
DCT 2 0.99 42.56
DCT 3 0.97 62.46
DCT 4 0.96 82.57
FMRadio 2 1.00 2.69
FMRadio 3 1.00 4.00
FMRadio 4 0.99 5.34
TDE 2 1.00 13.31
TDE 3 1.00 19.80
TDE 4 1.00 28.81
FFT 2 0.98 43.91
FFT 3 0.98 46.85
FFT 4 0.95 95.50
27
28
Speedup Results
28
29
Summary
  • Approximation algorithm for solving actor
    allocation problem
  • Data rate transfer model that resolves
    bottlenecks
  • We separate the bottleneck elimination from the
    actor allocation
  • We implemented our approach and compared with an
    optimal approach
  • Optimal approach has unpredictable time
  • Our approach has negligible time for all
    benchmarks
  • Quality of our approach is at most 5 off the
    optimum
  • For up to 8 processors we achieve a geometric
    mean speedup of 6.95x over single processor
    execution

29
30
Related Works
  • 1 Static Scheduling of SDF Programs for DSP
    Lee 87
  • 2 StreamIt A language for streaming
    applications Thies 02
  • 3 Phased Scheduling of Stream Programs Thies
    03
  • 4 Exploiting Coarse Grained Task, Data, and
    Pipeline Parallelism in
  • Stream Programs Thies 06
  • 5 Orchestrating the Execution of Stream
    Programs on Cell Scott 08
  • 6 Software Pipelined Execution of Stream
    Programs on GPUs
  • Udupa09
  • 7 Synergistic Execution of Stream Programs on
    Multicores with
  • Accelerators Udupa 09

30
31
Questions?
Write a Comment
User Comments (0)
About PowerShow.com