Title: S3P: Automatic, Optimized Mapping of Signal Processing Applications to Parallel Architectures
1S3P Automatic, Optimized Mapping ofSignal
Processing Applications toParallel Architectures
- Mr. Henry Hoffmann, Dr. Jeremy Kepner, Mr. Robert
Bond - MIT Lincoln Laboratory
- 27 September 2001
- HPEC Workshop, Lexington, MA
- This work is sponsored by United States Air Force
under Contract F19628-00-C-0002. Opinions,
interpretations, conclusions, and recommendations
are those of the author and are not necessarily
endorsed by the Department of Defense.
2Outline
- Introduction
- Design
- Demonstration
- Results
- Summary
3PCA Need System Level Optimization
Signal Processing Application (made up of PCA
components)
Beamform XOUT w XIN
Filter XOUT FIR(XIN)
Detect XOUT XINgtc
Applications
A
B
- Applications built with components
- Components have a defined scope
- Capable of local optimization
- System requires global optimization
- Not visible to components
- Too complex to add to application
- Need system level optimization capabilities as
part of PCA
Components
4Example Optimum System Latency
- Simple two component system
- Local optimum fails to satisfy global constraints
- Need system view to find global optimum
5System Optimization Challenge
Signal Processing Application
Beamform XOUT w XIN
Filter XOUT FIR(XIN)
Detect XOUT XINgtc
Optimal Resource Allocation (Latency, Throughput,
Memory, Bandwidth )
Compute Fabric (Cluster, FPGA, SOC )
- Optimizing to system constraints requires two way
component/system knowledge exchange - Need a framework to mediate exchange and perform
system level optimization
6S3P Lincoln Internal RD Program
- Goal applications that self-optimize to any
hardware - Combine LL system expertise and LCS FFTW approach
Parallel Signal Processing Kepner/Hoffmann
(Lincoln)
S3P Framework
Algorithm Stages
N
2
1
Processor Mappings
. . .
S3P brings self-optimizing (FFTW) approach to
parallel signal processing systems
1
2
Best Mappings
Time Verify
. . .
M
Self-Optimizing Software Leiserson/Frigo (MIT LCS)
- Framework exploits graph theory abstraction
- Broadly applicable to system optimization
problems - Defines clear component and system requirements
7Outline
- Introduction
- Design
- Demonstration
- Results
- Summary
8System Requirements
Decomposable into Tasks (comp) and Conduits (comm)
Beamform XOUT w XIN
Filter XOUT FIR(XIN)
Detect XOUT XINgtc
Mappable to different sets of hardware
Measurable resource usage of each mapping
- Each compute stage can be mapped to different
sets of hardware and timed
9System Graph
Beamform
Filter
Detect
Edge is a conduit between a pair of task mappings
Node is a unique mapping of a task
- System Graph can store the hardware resource
usage of every possible Task Conduit
10Path System Mapping
Beamform
Filter
Detect
Best Path is the optimal system mapping
Each path is a complete system mapping
- Graph construct is very general and widely used
for optimization problems - Many efficient techniques for choosing best
path (under constraints), such as Dynamic
Programming
11Example Maximize Throughput
Beamform
Filter
Detect
Edge stores conduit time for a given pair of
mappings
1.5
2.0
3.0
Node stores task time for a each mapping
3.0
4.0
6.0
6.0
8.0
More Hardware
16.0
- Goal Maximize throughput and minimize hardware
- Choose path with the smallest bottleneck that
satisfies hardware constraint
12Path Finding Algorithms
- Graph construct is very general
- Widely used for optimization problems
- Many efficient techniques for choosing best
path (under constraints) such as Dikkstras
Algorithm and Dynamic Programming
N total hardware units M number of tasks Pi
number of mappings for task i t
M pathTableMN all infinite weight
paths for( j1..M ) for( k1..Pj ) for(
ij1..N-t1) if( i-sizek gt j )
if( j gt 1 ) w weightpathTablej-1
i-sizek weightk
weightedgelastpathTablej-1i-sizek,k
p addVertexpathTablej-1i-sizek
, k else w weightk
p makePathk if(
weightpathTableji gt w )
pathTableji p
t t - 1
Initialize Graph G Initialize source vertex
s Store all vertices of G in a minimum priority
queue Q while (Q is not empty) u popQ
for (each vertex v, adjacent to u) w
u.totalPathWeight() weight of edge ltu,vgt
v.weight() if(v.totalPathWeight() gt
w) v.totalPathWeight() w
v.predecessor() u
Dynamic Programming
Dijkstras Algorithm
13S3P Inputs and Outputs
Algorithm Information
Application
S3P Framework
Best System Mapping
System Constraints
- Can flexibly add information about
- Application
- Algorithm
- System
- Hardware
Hardware Information
14Outline
- Introduction
- Design
- Demonstration
- Results
- Summary
15S3P Demonstration Testbed
Multi-Stage Application
Hardware (Workstation Cluster)
16Multi-Stage Application
- Features
- Generic radar/sonar signal processing chain
- Utilizes key kernels (FIR, matrix multiply, FFT
and corner turn) - Scalable to any problem size (fully parameterize
algorithm) - Self validates (built-in target generator)
17Parallel Vector Library (PVL)
- Simple mappable components support data, task and
pipeline parallelism
18Hardware Platform
- Network of 8 Linux workstations
- Dual 800 MHz Pentium III processors
- Communication
- Gigabit ethernet, 8-port switch
- Isolated network
- Software
- Linux kernel release 2.2.14
- GNU C Compiler
- MPICH communication library over TCP/IP
- Advantages
- Software tools
- Widely available
- Inexpensive (high Mflops/)
- Excellent rapid prototyping platform
- Disadvantages
- Non real-time OS
- Non real-time messaging
- Slower interconnect
- Difficulty to model
- SMP behavior erratic
19S3P Engine
Application Program
Algorithm Information
Best System Mapping
S3P Engine
System Constraints
Hardware Information
MapGenerator
MapTimer
MapSelector
- Map Generator constructs the system graph for all
candidate mappings - Map Timer times each node and edge of the system
graph - Map Selector searches the system graph for the
optimal set of maps
20Outline
- Introduction
- Design
- Demonstration
- Results
- Summary
21Optimal Throughput
- Vary number of processors used on each stage
- Time each computation stage and communication
conduit - Find path with minimum bottleneck
1 CPU 2 CPU 3 CPU 4 CPU
22S3P Timings (4 cpu max)
- Graphical depiction of timings (wider is better)
4 CPU 3 CPU 2 CPU 1 CPU
Tasks
23S3P Timings (12 cpu max) (wider is better)
12 CPU 8 CPU 6 CPU 4 CPU 2 CPU
- Large amount of data requires algorithm to find
best path
Input
Low Pass Filter
Beamform
Matched Filter
Tasks
24Predicted and Achieved Latency(4-8 cpu max)
Large (48x128K) Problem Size
Small (48x4K) Problem Size
Latency (sec)
Latency (sec)
Maximum Number of Processors
Maximum Number of Processors
- Find path that produces minimum latency for a
given number of processors - Excellent agreement between S3P predicted and
achieved latencies
25Predicted and Achieved Throughput(4-8 cpu max)
Large (48x128K) Problem Size
Small (48x4K) Problem Size
Throughput (pulses/sec)
Throughput (pulse/sec)
Maximum Number of Processors
Maximum Number of Processors
- Find path that produces maximum throughput for a
given number of processors - Excellent agreement between S3P predicted and
achieved throughput
26SMP Results (16 cpu max)
Large (48x128K) Problem Size
Throughput (pulse/sec)
Maximum Number of Processors
- SMP overstresses Linux Real Time capabilities
- Poor overall system performance
- Divergence between predicted and measured
27Simulated (128 cpu max)
Small (48x4K) Problem Size
Small (48x4K) Problem Size
Throughput (pulses/sec)
Latency (sec)
Maximum Number of Processors
Maximum Number of Processors
- Simulator allows exploration of larger systems
28Reducing the Search Space-Algorithm Comparison-
Graph algorithms provide baseline performance
Hill Climbing performance varies as a function of
initialization and neighborhood definition
Number of Timings Required
Preprocessor outperforms all other algorithms.
Maximum Number of Processors
29Future Work
- Program area
- Determine how to incorporate global optimization
into other middleware efforts (e.g. PCA, HPEC-SI,
) - Hardware area
- Scale and demonstrate on larger/real-time system
- HPCMO Mercury system at WPAFB
- Expect even better results than on Linux cluster
- Apply to parallel hardware
- RAW
- Algorithm area
- Exploits ways of reducing search space
- Provide solution families via sensitivity
analysis
30Outline
- Introduction
- Design
- Demonstration
- Results
- Summary
31Summary
- System level constraints (latency, throughput,
hardware size, ) necessitate system level
optimization - Application requirements for system level
optimization are - Decomposable into components (input, filtering,
output, ) - Mappable to different configurations (
processors, links, ) - Measureable resource usage (time, memory, )
- S3P demonstrates global optimization is feasible
separate from the application
32Acknowldegements
- Matteo Frigo (MIT/LCS Vanu, Inc.)
- Charles Leiserson (MIT/LCS)
- Adam Wierman (CMU)