S3P: Automatic, Optimized Mapping of Signal Processing Applications to Parallel Architectures - PowerPoint PPT Presentation

About This Presentation

Title:

S3P: Automatic, Optimized Mapping of Signal Processing Applications to Parallel Architectures

Description:

MIT Lincoln Laboratory. 27 September 2001. HPEC Workshop, Lexington, MA ... Kepner/Hoffmann (Lincoln) Goal: applications that self-optimize to any hardware ... – PowerPoint PPT presentation

Number of Views:121

Avg rating:3.0/5.0

Slides: 33

Provided by: jeremy162

Learn more at: http://people.csail.mit.edu

Category:

more less

Transcript and Presenter's Notes

Title: S3P: Automatic, Optimized Mapping of Signal Processing Applications to Parallel Architectures

1
S3P Automatic, Optimized Mapping ofSignal
Processing Applications toParallel Architectures

Mr. Henry Hoffmann, Dr. Jeremy Kepner, Mr. Robert
Bond
MIT Lincoln Laboratory
27 September 2001
HPEC Workshop, Lexington, MA
This work is sponsored by United States Air Force
under Contract F19628-00-C-0002. Opinions,
interpretations, conclusions, and recommendations
are those of the author and are not necessarily
endorsed by the Department of Defense.

2
Outline

Introduction
Design
Demonstration
Results
Summary

3
PCA Need System Level Optimization
Signal Processing Application (made up of PCA
components)
Beamform XOUT w XIN
Filter XOUT FIR(XIN)
Detect XOUT XINgtc
Applications
A
B

Applications built with components
Components have a defined scope
Capable of local optimization
System requires global optimization
Not visible to components
Too complex to add to application
Need system level optimization capabilities as
part of PCA

Components
4
Example Optimum System Latency

Simple two component system
Local optimum fails to satisfy global constraints
Need system view to find global optimum

5
System Optimization Challenge
Signal Processing Application
Beamform XOUT w XIN
Filter XOUT FIR(XIN)
Detect XOUT XINgtc
Optimal Resource Allocation (Latency, Throughput,
Memory, Bandwidth )
Compute Fabric (Cluster, FPGA, SOC )

Optimizing to system constraints requires two way
component/system knowledge exchange
Need a framework to mediate exchange and perform
system level optimization

6
S3P Lincoln Internal RD Program

Goal applications that self-optimize to any
hardware
Combine LL system expertise and LCS FFTW approach

Parallel Signal Processing Kepner/Hoffmann
(Lincoln)
S3P Framework
Algorithm Stages
N
2
1
Processor Mappings
. . .
S3P brings self-optimizing (FFTW) approach to
parallel signal processing systems
1
2
Best Mappings
Time Verify
. . .
M
Self-Optimizing Software Leiserson/Frigo (MIT LCS)

Framework exploits graph theory abstraction
Broadly applicable to system optimization
problems
Defines clear component and system requirements

7
Outline

Introduction
Design
Demonstration
Results
Summary

8
System Requirements
Decomposable into Tasks (comp) and Conduits (comm)
Beamform XOUT w XIN
Filter XOUT FIR(XIN)
Detect XOUT XINgtc
Mappable to different sets of hardware
Measurable resource usage of each mapping

Each compute stage can be mapped to different
sets of hardware and timed

9
System Graph
Beamform
Filter
Detect
Edge is a conduit between a pair of task mappings
Node is a unique mapping of a task

System Graph can store the hardware resource
usage of every possible Task Conduit

10
Path System Mapping
Beamform
Filter
Detect
Best Path is the optimal system mapping
Each path is a complete system mapping

Graph construct is very general and widely used
for optimization problems
Many efficient techniques for choosing best
path (under constraints), such as Dynamic
Programming

11
Example Maximize Throughput
Beamform
Filter
Detect
Edge stores conduit time for a given pair of
mappings
1.5
2.0
3.0
Node stores task time for a each mapping
3.0
4.0
6.0
6.0
8.0
More Hardware
16.0

Goal Maximize throughput and minimize hardware
Choose path with the smallest bottleneck that
satisfies hardware constraint

12
Path Finding Algorithms

Graph construct is very general
Widely used for optimization problems
Many efficient techniques for choosing best
path (under constraints) such as Dikkstras
Algorithm and Dynamic Programming

N total hardware units M number of tasks Pi
number of mappings for task i t
M pathTableMN all infinite weight
paths for( j1..M ) for( k1..Pj ) for(
ij1..N-t1) if( i-sizek gt j )
if( j gt 1 ) w weightpathTablej-1
i-sizek weightk
weightedgelastpathTablej-1i-sizek,k
p addVertexpathTablej-1i-sizek
, k else w weightk
p makePathk if(
weightpathTableji gt w )
pathTableji p
t t - 1
Initialize Graph G Initialize source vertex
s Store all vertices of G in a minimum priority
queue Q while (Q is not empty) u popQ
for (each vertex v, adjacent to u) w
u.totalPathWeight() weight of edge ltu,vgt
v.weight() if(v.totalPathWeight() gt
w) v.totalPathWeight() w
v.predecessor() u
Dynamic Programming
Dijkstras Algorithm
13
S3P Inputs and Outputs
Algorithm Information
Application
S3P Framework
Best System Mapping
System Constraints

Can flexibly add information about
Application
Algorithm
System
Hardware

Hardware Information
14
Outline

Introduction
Design
Demonstration
Results
Summary

15
S3P Demonstration Testbed
Multi-Stage Application
Hardware (Workstation Cluster)
16
Multi-Stage Application

Features
Generic radar/sonar signal processing chain
Utilizes key kernels (FIR, matrix multiply, FFT
and corner turn)
Scalable to any problem size (fully parameterize
algorithm)
Self validates (built-in target generator)

17
Parallel Vector Library (PVL)

Simple mappable components support data, task and
pipeline parallelism

18
Hardware Platform

Network of 8 Linux workstations
Dual 800 MHz Pentium III processors
Communication
Gigabit ethernet, 8-port switch
Isolated network
Software
Linux kernel release 2.2.14
GNU C Compiler
MPICH communication library over TCP/IP

Advantages
Software tools
Widely available
Inexpensive (high Mflops/)
Excellent rapid prototyping platform

Disadvantages
Non real-time OS
Non real-time messaging
Slower interconnect
Difficulty to model
SMP behavior erratic

19
S3P Engine
Application Program
Algorithm Information
Best System Mapping
S3P Engine
System Constraints
Hardware Information
MapGenerator
MapTimer
MapSelector

Map Generator constructs the system graph for all
candidate mappings
Map Timer times each node and edge of the system
graph
Map Selector searches the system graph for the
optimal set of maps

20
Outline

Introduction
Design
Demonstration
Results
Summary

21
Optimal Throughput

Vary number of processors used on each stage
Time each computation stage and communication
conduit
Find path with minimum bottleneck

1 CPU 2 CPU 3 CPU 4 CPU
22
S3P Timings (4 cpu max)

Graphical depiction of timings (wider is better)

4 CPU 3 CPU 2 CPU 1 CPU
Tasks
23
S3P Timings (12 cpu max) (wider is better)
12 CPU 8 CPU 6 CPU 4 CPU 2 CPU

Large amount of data requires algorithm to find
best path

Input
Low Pass Filter
Beamform
Matched Filter
Tasks
24
Predicted and Achieved Latency(4-8 cpu max)
Large (48x128K) Problem Size
Small (48x4K) Problem Size
Latency (sec)
Latency (sec)
Maximum Number of Processors
Maximum Number of Processors

Find path that produces minimum latency for a
given number of processors
Excellent agreement between S3P predicted and
achieved latencies

25
Predicted and Achieved Throughput(4-8 cpu max)
Large (48x128K) Problem Size
Small (48x4K) Problem Size
Throughput (pulses/sec)
Throughput (pulse/sec)
Maximum Number of Processors
Maximum Number of Processors

Find path that produces maximum throughput for a
given number of processors
Excellent agreement between S3P predicted and
achieved throughput

26
SMP Results (16 cpu max)
Large (48x128K) Problem Size
Throughput (pulse/sec)
Maximum Number of Processors

SMP overstresses Linux Real Time capabilities
Poor overall system performance
Divergence between predicted and measured

27
Simulated (128 cpu max)
Small (48x4K) Problem Size
Small (48x4K) Problem Size
Throughput (pulses/sec)
Latency (sec)
Maximum Number of Processors
Maximum Number of Processors

Simulator allows exploration of larger systems

28
Reducing the Search Space-Algorithm Comparison-
Graph algorithms provide baseline performance
Hill Climbing performance varies as a function of
initialization and neighborhood definition
Number of Timings Required
Preprocessor outperforms all other algorithms.
Maximum Number of Processors
29
Future Work

Program area
Determine how to incorporate global optimization
into other middleware efforts (e.g. PCA, HPEC-SI,
)
Hardware area
Scale and demonstrate on larger/real-time system
HPCMO Mercury system at WPAFB
Expect even better results than on Linux cluster
Apply to parallel hardware
RAW
Algorithm area
Exploits ways of reducing search space
Provide solution families via sensitivity
analysis

30
Outline

Introduction
Design
Demonstration
Results
Summary

31
Summary

System level constraints (latency, throughput,
hardware size, ) necessitate system level
optimization
Application requirements for system level
optimization are
Decomposable into components (input, filtering,
output, )
Mappable to different configurations (
processors, links, )
Measureable resource usage (time, memory, )
S3P demonstrates global optimization is feasible
separate from the application

32
Acknowldegements