S3P: Automatic, Optimized Mapping of Signal Processing Applications to Parallel Architectures - PowerPoint PPT Presentation

About This Presentation
Title:

S3P: Automatic, Optimized Mapping of Signal Processing Applications to Parallel Architectures

Description:

MIT Lincoln Laboratory. 27 September 2001. HPEC Workshop, Lexington, MA ... Kepner/Hoffmann (Lincoln) Goal: applications that self-optimize to any hardware ... – PowerPoint PPT presentation

Number of Views:121
Avg rating:3.0/5.0
Slides: 33
Provided by: jeremy162
Category:

less

Transcript and Presenter's Notes

Title: S3P: Automatic, Optimized Mapping of Signal Processing Applications to Parallel Architectures


1
S3P Automatic, Optimized Mapping ofSignal
Processing Applications toParallel Architectures
  • Mr. Henry Hoffmann, Dr. Jeremy Kepner, Mr. Robert
    Bond
  • MIT Lincoln Laboratory
  • 27 September 2001
  • HPEC Workshop, Lexington, MA
  • This work is sponsored by United States Air Force
    under Contract F19628-00-C-0002. Opinions,
    interpretations, conclusions, and recommendations
    are those of the author and are not necessarily
    endorsed by the Department of Defense.

2
Outline
  • Introduction
  • Design
  • Demonstration
  • Results
  • Summary

3
PCA Need System Level Optimization
Signal Processing Application (made up of PCA
components)
Beamform XOUT w XIN
Filter XOUT FIR(XIN)
Detect XOUT XINgtc
Applications
A
B
  • Applications built with components
  • Components have a defined scope
  • Capable of local optimization
  • System requires global optimization
  • Not visible to components
  • Too complex to add to application
  • Need system level optimization capabilities as
    part of PCA

Components
4
Example Optimum System Latency
  • Simple two component system
  • Local optimum fails to satisfy global constraints
  • Need system view to find global optimum

5
System Optimization Challenge
Signal Processing Application
Beamform XOUT w XIN
Filter XOUT FIR(XIN)
Detect XOUT XINgtc
Optimal Resource Allocation (Latency, Throughput,
Memory, Bandwidth )
Compute Fabric (Cluster, FPGA, SOC )
  • Optimizing to system constraints requires two way
    component/system knowledge exchange
  • Need a framework to mediate exchange and perform
    system level optimization

6
S3P Lincoln Internal RD Program
  • Goal applications that self-optimize to any
    hardware
  • Combine LL system expertise and LCS FFTW approach

Parallel Signal Processing Kepner/Hoffmann
(Lincoln)
S3P Framework
Algorithm Stages
N
2
1
Processor Mappings
. . .
S3P brings self-optimizing (FFTW) approach to
parallel signal processing systems
1
2
Best Mappings
Time Verify
. . .
M
Self-Optimizing Software Leiserson/Frigo (MIT LCS)
  • Framework exploits graph theory abstraction
  • Broadly applicable to system optimization
    problems
  • Defines clear component and system requirements

7
Outline
  • Introduction
  • Design
  • Demonstration
  • Results
  • Summary

8
System Requirements
Decomposable into Tasks (comp) and Conduits (comm)
Beamform XOUT w XIN
Filter XOUT FIR(XIN)
Detect XOUT XINgtc
Mappable to different sets of hardware
Measurable resource usage of each mapping
  • Each compute stage can be mapped to different
    sets of hardware and timed

9
System Graph
Beamform
Filter
Detect
Edge is a conduit between a pair of task mappings
Node is a unique mapping of a task
  • System Graph can store the hardware resource
    usage of every possible Task Conduit

10
Path System Mapping
Beamform
Filter
Detect
Best Path is the optimal system mapping
Each path is a complete system mapping
  • Graph construct is very general and widely used
    for optimization problems
  • Many efficient techniques for choosing best
    path (under constraints), such as Dynamic
    Programming

11
Example Maximize Throughput
Beamform
Filter
Detect
Edge stores conduit time for a given pair of
mappings
1.5
2.0
3.0
Node stores task time for a each mapping
3.0
4.0
6.0
6.0
8.0
More Hardware
16.0
  • Goal Maximize throughput and minimize hardware
  • Choose path with the smallest bottleneck that
    satisfies hardware constraint

12
Path Finding Algorithms
  • Graph construct is very general
  • Widely used for optimization problems
  • Many efficient techniques for choosing best
    path (under constraints) such as Dikkstras
    Algorithm and Dynamic Programming

N total hardware units M number of tasks Pi
number of mappings for task i t
M pathTableMN all infinite weight
paths for( j1..M ) for( k1..Pj ) for(
ij1..N-t1) if( i-sizek gt j )
if( j gt 1 ) w weightpathTablej-1
i-sizek weightk
weightedgelastpathTablej-1i-sizek,k
p addVertexpathTablej-1i-sizek
, k else w weightk
p makePathk if(
weightpathTableji gt w )
pathTableji p
t t - 1
Initialize Graph G Initialize source vertex
s Store all vertices of G in a minimum priority
queue Q while (Q is not empty) u popQ
for (each vertex v, adjacent to u) w
u.totalPathWeight() weight of edge ltu,vgt
v.weight() if(v.totalPathWeight() gt
w) v.totalPathWeight() w
v.predecessor() u
Dynamic Programming
Dijkstras Algorithm
13
S3P Inputs and Outputs
Algorithm Information
Application
S3P Framework
Best System Mapping
System Constraints
  • Can flexibly add information about
  • Application
  • Algorithm
  • System
  • Hardware

Hardware Information
14
Outline
  • Introduction
  • Design
  • Demonstration
  • Results
  • Summary

15
S3P Demonstration Testbed
Multi-Stage Application
Hardware (Workstation Cluster)
16
Multi-Stage Application
  • Features
  • Generic radar/sonar signal processing chain
  • Utilizes key kernels (FIR, matrix multiply, FFT
    and corner turn)
  • Scalable to any problem size (fully parameterize
    algorithm)
  • Self validates (built-in target generator)

17
Parallel Vector Library (PVL)
  • Simple mappable components support data, task and
    pipeline parallelism

18
Hardware Platform
  • Network of 8 Linux workstations
  • Dual 800 MHz Pentium III processors
  • Communication
  • Gigabit ethernet, 8-port switch
  • Isolated network
  • Software
  • Linux kernel release 2.2.14
  • GNU C Compiler
  • MPICH communication library over TCP/IP
  • Advantages
  • Software tools
  • Widely available
  • Inexpensive (high Mflops/)
  • Excellent rapid prototyping platform
  • Disadvantages
  • Non real-time OS
  • Non real-time messaging
  • Slower interconnect
  • Difficulty to model
  • SMP behavior erratic

19
S3P Engine
Application Program
Algorithm Information
Best System Mapping
S3P Engine
System Constraints
Hardware Information
MapGenerator
MapTimer
MapSelector
  • Map Generator constructs the system graph for all
    candidate mappings
  • Map Timer times each node and edge of the system
    graph
  • Map Selector searches the system graph for the
    optimal set of maps

20
Outline
  • Introduction
  • Design
  • Demonstration
  • Results
  • Summary

21
Optimal Throughput
  • Vary number of processors used on each stage
  • Time each computation stage and communication
    conduit
  • Find path with minimum bottleneck

1 CPU 2 CPU 3 CPU 4 CPU
22
S3P Timings (4 cpu max)
  • Graphical depiction of timings (wider is better)

4 CPU 3 CPU 2 CPU 1 CPU
Tasks
23
S3P Timings (12 cpu max) (wider is better)
12 CPU 8 CPU 6 CPU 4 CPU 2 CPU
  • Large amount of data requires algorithm to find
    best path

Input
Low Pass Filter
Beamform
Matched Filter
Tasks
24
Predicted and Achieved Latency(4-8 cpu max)
Large (48x128K) Problem Size
Small (48x4K) Problem Size
Latency (sec)
Latency (sec)
Maximum Number of Processors
Maximum Number of Processors
  • Find path that produces minimum latency for a
    given number of processors
  • Excellent agreement between S3P predicted and
    achieved latencies

25
Predicted and Achieved Throughput(4-8 cpu max)
Large (48x128K) Problem Size
Small (48x4K) Problem Size
Throughput (pulses/sec)
Throughput (pulse/sec)
Maximum Number of Processors
Maximum Number of Processors
  • Find path that produces maximum throughput for a
    given number of processors
  • Excellent agreement between S3P predicted and
    achieved throughput

26
SMP Results (16 cpu max)
Large (48x128K) Problem Size
Throughput (pulse/sec)
Maximum Number of Processors
  • SMP overstresses Linux Real Time capabilities
  • Poor overall system performance
  • Divergence between predicted and measured

27
Simulated (128 cpu max)
Small (48x4K) Problem Size
Small (48x4K) Problem Size
Throughput (pulses/sec)
Latency (sec)
Maximum Number of Processors
Maximum Number of Processors
  • Simulator allows exploration of larger systems

28
Reducing the Search Space-Algorithm Comparison-
Graph algorithms provide baseline performance
Hill Climbing performance varies as a function of
initialization and neighborhood definition
Number of Timings Required
Preprocessor outperforms all other algorithms.
Maximum Number of Processors
29
Future Work
  • Program area
  • Determine how to incorporate global optimization
    into other middleware efforts (e.g. PCA, HPEC-SI,
    )
  • Hardware area
  • Scale and demonstrate on larger/real-time system
  • HPCMO Mercury system at WPAFB
  • Expect even better results than on Linux cluster
  • Apply to parallel hardware
  • RAW
  • Algorithm area
  • Exploits ways of reducing search space
  • Provide solution families via sensitivity
    analysis

30
Outline
  • Introduction
  • Design
  • Demonstration
  • Results
  • Summary

31
Summary
  • System level constraints (latency, throughput,
    hardware size, ) necessitate system level
    optimization
  • Application requirements for system level
    optimization are
  • Decomposable into components (input, filtering,
    output, )
  • Mappable to different configurations (
    processors, links, )
  • Measureable resource usage (time, memory, )
  • S3P demonstrates global optimization is feasible
    separate from the application

32
Acknowldegements
  • Matteo Frigo (MIT/LCS Vanu, Inc.)
  • Charles Leiserson (MIT/LCS)
  • Adam Wierman (CMU)
Write a Comment
User Comments (0)
About PowerShow.com