High Performance Embedded Computing Software Initiative HPECSI - PowerPoint PPT Presentation

1 / 41
About This Presentation
Title:

High Performance Embedded Computing Software Initiative HPECSI

Description:

MIT Lincoln Laboratory. www.hpec-si.org. High Performance Embedded Computing ... Lincoln. Algorithm and hardware mapping are linked ... – PowerPoint PPT presentation

Number of Views:75
Avg rating:3.0/5.0
Slides: 42
Provided by: jeremy78
Learn more at: http://www.hpec-si.org
Category:

less

Transcript and Presenter's Notes

Title: High Performance Embedded Computing Software Initiative HPECSI


1
High Performance Embedded ComputingSoftware
Initiative (HPEC-SI)

Dr. Jeremy Kepner MIT Lincoln Laboratory
  • This work is sponsored by the High Performance
    Computing Modernization Office under Air Force
    Contract F19628-00-C-0002. Opinions,
    interpretations, conclusions, and recommendations
    are those of the author and are not necessarily
    endorsed by the United States Government.

2
Outline
  • Introduction
  • Software Standards
  • Parallel VSIPL
  • Future Challenges
  • Summary

3
Overview - High Performance Embedded Computing
(HPEC) Initiative
Common Imagery Processor (CIP)
Embedded multi-processor
Shared memory server
ASARS-2
Challenge Transition advanced software
technology and practices into major defense
acquisition programs
4
Why Is DoD Concerned with Embedded Software?
Source HPEC Market Study March 2001
Estimated DoD expenditures for embedded signal
and image processing hardware and software (B)
  • COTS acquisition practices have shifted the
    burden from point design hardware to point
    design software
  • Software costs for embedded systems could be
    reduced by one-third with improved programming
    models, methodologies, and standards

5
Issues with Current HPEC Development Inadequacy
of Software Practices Standards
  • High Performance Embedded Computing pervasive
    through DoD applications
  • Airborne Radar Insertion program
  • 85 software rewrite for each hardware platform
  • Missile common processor
  • Processor board costs lt 100k
  • Software development costs gt 100M
  • Torpedo upgrade
  • Two software re-writes required after changes in
    hardware design
  • Today Embedded Software Is
  • Not portable
  • Not scalable
  • Difficult to develop
  • Expensive to maintain

6
Evolution of Software Support TowardsWrite
Once, Run Anywhere/Anysize
DoD software development
COTS development
Application
Application
  • Application software has traditionally been tied
    to the hardware

Vendor Software
Vendor SW
1990
7
Overall Initiative Goals Impact
  • Program Goals
  • Develop and integrate software technologies for
    embedded parallel systems to address portability,
    productivity, and performance
  • Engage acquisition community to promote
    technology insertion
  • Deliver quantifiable benefits

Portability reduction in lines-of-code to
change port/scale to new system Productivity
reduction in overall lines-of-code Performance c
omputation and communication benchmarks
8
HPEC-SI Path to Success
  • HPEC Software Initiative builds on
  • Proven technology
  • Business models
  • Better software practices

9
Organization
  • Partnership with ODUSD(ST), Government Labs,
    FFRDCs, Universities, Contractors, Vendors and
    DoD programs
  • Over 100 participants from over 20 organizations

10
Outline
  • Introduction
  • Software Standards
  • Parallel VSIPL
  • Future Challenges
  • Summary

11
Emergence of Component Standards
Parallel Embedded Processor
Data CommunicationMPI, MPI/RT, DRI
Control CommunicationCORBA, HP-CORBA
P0
P1
P2
P3
Consoles
Other Computers
Computation VSIPL VSIPL, VSIPL
Definitions VSIPL Vector, Signal, and Image
Processing Library VSIPL Parallel Object
Oriented VSIPL MPI Message-passing
interface MPI/RT MPI real-time DRI Data
Re-org Interface CORBA Common Object Request
Broker Architecture HP-CORBA High Performance
CORBA
  • HPEC Initiative - Builds on completed research
    and existing standards and libraries

12
The Path to Parallel VSIPL
(worlds first parallel object oriented standard)
  • First demo successfully completed
  • VSIPL v0.5 spec completed
  • VSIPL v0.1 code available
  • Parallel VSIPL spec in progress
  • High performance C demonstrated

Time
Phase 3
Applied Research Self-optimization
Phase 2
Development Fault tolerance
Applied Research Fault tolerance
prototype
Phase 1
Applied Research Unified Comp/Comm Lib
Demonstration Unified Comp/Comm Lib
Development Unified Comp/Comm Lib
Parallel VSIPL
prototype
Development Object-Oriented Standards
Demonstration Object-Oriented Standards
VSIPL
Functionality
Parallel VSIPL
Demonstration Existing Standards
  • Unified embedded computation/ communication
    standard
  • Demonstrate scalability

VSIPL
VSIPL MPI
  • High-level code abstraction
  • Reduce code size 3x
  • Demonstrate insertions into fielded systems
    (e.g., CIP)
  • Demonstrate 3x portability

13
Working Group Technical Scope
Development VSIPL
Applied Research Parallel VSIPL
-MAPPING (task/pipeline parallel) -Reconfiguration
(for fault tolerance) -Threads -Reliability/Avail
ability -Data Permutation (DRI functionality) -Too
ls (profiles, timers, ...) -Quality of Service
-MAPPING (data parallelism) -Early binding
(computations) -Compatibility (backward/forward) -
Local Knowledge (accessing local
data) -Extensibility (adding new
functions) -Remote Procedure Calls (CORBA) -C
Compiler Support -Test Suite -Adoption Incentives
(vendor, integrator)
14
Overall Technical Tasks and Schedule
Near
Mid
Long

Task Name
FY01
FY02
FY03
FY04
FY05
FY06
FY07
FY08
VSIPL (Vector, Signal, and Image Processing
Library) MPI (Message Passing Interface) VSIPL
(Object Oriented) v0.1 Spec v0.1 Code v0.5
Spec Code v1.0 Spec Code Parallel
VSIPL v0.1 Spec v0.1 Code v0.5 Spec
Code v1.0 Spec Code Fault Tolerance/ Self
Optimizing Software
CIP
Demo 2
Applied Research
Development
CIP
Demo 2
Demonstrate
Demo 3
Demo 4
Demo 5
Demo 6
15
HPEC-SI Goals1st Demo Achievements
Portability zero code changes required Productivi
ty DRI code 6x smaller vs MPI (est) Performance
3x reduced cost or form factor
Demonstrate
Achieved 10x Goal 3x Portability
Achieved 6x Goal 3x Productivity
Object Oriented
Open Standards
HPEC Software Initiative
Interoperable Scalable
Prototype
Develop
Performance Goal 1.5x Achieved 2x
16
Outline
  • Introduction
  • Software Standards
  • Parallel VSIPL
  • Future Challenges
  • Summary

17
Parallel Pipeline
Signal Processing Algorithm
Filter XOUT FIR(XIN )
Beamform XOUT w XIN
Detect XOUT XINgtc
  • Data Parallel within stages
  • Task/Pipeline Parallel across stages

18
Types of Parallelism
Data Parallel
19
Current Approach to Parallel Code
Code
Algorithm Mapping
while(!done) if ( rank()1 rank()2
) stage1 () else if ( rank()3 rank()4
) stage2()
Stage 1
Stage 2
Proc1
Proc3
Proc 4
Proc 2
while(!done) if ( rank()1 rank()2
) stage1() else if ( rank()3
rank()4) rank()5 rank6 )
stage2()
  • Algorithm and hardware mapping are linked
  • Resulting code is non-scalable and non-portable

20
Scalable Approach
Single Processor Mapping
include ltVector.hgt include ltAddPvl.hgt void
addVectors(aMap, bMap, cMap) Vectorlt
ComplexltFloatgt gt a(a, aMap, LENGTH) Vectorlt
ComplexltFloatgt gt b(b, bMap, LENGTH) Vectorlt
ComplexltFloatgt gt c(c, cMap, LENGTH) b
1 c 2 abc
Multi Processor Mapping
  • Single processor and multi-processor code are the
    same
  • Maps can be changed without changing software
  • High level code is compact

Lincoln Parallel Vector Library (PVL)
21
C Expression Templates and PETE
Expression
ABCD
ExpressionTemplates
Main
Expression Type
Parse Tree
1. Pass B and Creferences to operator
BinaryNodeltOpAssign, Vector, BinaryNodeltOpAdd,
Vector BinaryNodeltOpMultiply, Vector, Vector gtgtgt
B, C
Operator

2. Create expressionparse tree
B
C
3. Return expressionparse tree
Parse trees, not vectors, created
copy
4. Pass expression treereference to operator
copy
Operator
5. Calculate result andperform assignment
BC
A
  • Expression Templates enhance performance by
    allowing temporary variables to be avoided

22
PETE Linux Cluster Experiments
ABC
ABCD
ABCD/Efft(F)
1.2
1.3
1.2
1.1
1.1
1.2
1
Relative Execution Time
Relative Execution Time
Relative Execution Time
1
1.1
0.9
0.8
0.9
1
0.7
0.8
0.9
0.6
8
8
8
32
128
512
2048
8192
32768
131072
32
128
512
2048
8192
32768
131072
32
128
512
2048
8192
32768
131072
Vector Length
Vector Length
Vector Length
  • PVL with VSIPL has a small overhead
  • PVL with PETE can surpass VSIPL

23
PowerPC AltiVec Experiments
  • Results
  • Hand coded loop achieves good performance, but is
    problem specific and low level
  • Optimized VSIPL performs well for simple
    expressions, worse for more complex expressions
  • PETE style array operators perform almost as well
    as the hand-coded loop and are general, can be
    composed, and are high-level

ABC
ABCDEF
ABCD
ABCDE/F
Software Technology
AltiVec loop
VSIPL (vendor optimized)
PETE with AltiVec
  • C
  • AltiVec aware VSIPro Core Lite
  • (www.mpi-softtech.com)
  • No multiply-add
  • Cannot assume unit stride
  • Cannot assume vector alignment
  • C
  • PETE operators
  • Indirect use of AltiVec extensions
  • Assumes unit stride
  • Assumes vector alignment
  • C
  • For loop
  • Direct use of AltiVec extensions
  • Assumes unit stride
  • Assumes vector alignment

24
Outline
  • Introduction
  • Software Standards
  • Parallel VSIPL
  • Future Challenges
  • Summary

25
A sin(A) 2 B
  • Generated code (no temporaries)
  • for (index i 0 i lt A.size() i)
  • A.put(i, sin(A.get(i)) 2 B.get(i))
  • Apply inlining to transform to
  • for (index i 0 i lt A.size() i)
  • Ablocki sin(Ablocki) 2 Bblocki
  • Apply more inlining to transform to
  • T Bp (Bblock0) T Aend
    (AblockA.size())
  • for (T Ap (Ablock0) Ap lt pend Ap, Bp)
  • Ap fmadd (2, Bp, sin(Ap))
  • Or apply PowerPC AltiVec extensions
  • Each step can be automatically generated
  • Optimization level whatever vendor desires

26
BLAS zherk Routine
  • BLAS Basic Linear Algebra Subprograms
  • Hermitian matrix M conjug(M) Mt
  • zherk performs a rank-k update of Hermitian
    matrix C
  • C ? a ? A ? conjug(A)t b ? C
  • VSIPL code
  • A vsip_cmcreate_d(10,15,VSIP_ROW,MEM_NONE)
  • C vsip_cmcreate_d(10,10,VSIP_ROW,MEM_NONE)
  • tmp vsip_cmcreate_d(10,10,VSIP_ROW,MEM_NONE)
  • vsip_cmprodh_d(A,A,tmp) / Aconjug(A)t /
  • vsip_rscmmul_d(alpha,tmp,tmp)/ aAconjug(A)t
    /
  • vsip_rscmmul_d(beta,C,C) / bC /
  • vsip_cmadd_d(tmp,C,C) / aAconjug(A)t bC
    /
  • vsip_cblockdestroy(vsip_cmdestroy_d(tmp))
  • vsip_cblockdestroy(vsip_cmdestroy_d(C))
  • vsip_cblockdestroy(vsip_cmdestroy_d(A))
  • VSIPL code (also parallel)
  • Matrixltcomplexltdoublegt gt A(10,15)

27
Simple Filtering Application
  • int main ()
  • using namespace vsip
  • const length ROWS 64
  • const length COLS 4096
  • vsipl v
  • FFTltMatrix, complexltdoublegt, complexltdoublegt,
    FORWARD, 0, MULTIPLE, alg_hint ()gt
  • forward_fft (Domainlt2gt(ROWS,COLS), 1.0)
  • FFTltMatrix, complexltdoublegt, complexltdoublegt,
    INVERSE, 0, MULTIPLE, alg_hint ()gt inverse_fft
    (Domainlt2gt(ROWS,COLS), 1.0)
  • const Matrixltcomplexltdoublegt gt weights
    (load_weights (ROWS, COLS))
  • try
  • while (1) output (inverse_fft (forward_fft
    (input ()) weights))
  • catch (stdruntime_error)
  • // Successfully caught access outside domain.

28
Explicit Parallel Filter
  • include ltvsiplpp.hgt
  • using namespace VSIPL
  • const int ROWS 64
  • const int COLS 4096
  • int main (int argc, char argv)
  • MatrixltComplexltFloatgtgt W (ROWS, COLS,
    "WMap") // weights matrix
  • MatrixltComplexltFloatgtgt X (ROWS, COLS,
    "WMap") // input matrix
  • load_weights (W)
  • try
  • while (1)
  • input (X) //
    some input function
  • Y IFFT ( mul (FFT(X), W))
  • output (Y) //
    some output function
  • catch (Exception e) cerr ltlt e ltlt endl

29
Multi-Stage Filter (main)
  • using namespace vsip
  • const length ROWS 64
  • const length COLS 4096
  • int main (int argc, char argv)
  • sample_low_pass_filterltcomplexltfloatgt gt LPF()
  • sample_beamformltcomplexltfloatgt gt BF()
  • sample_matched_filterltcomplexltfloatgt gt MF()
  • try
  • while (1) output (MF(BF(LPF(input ()))))
  • catch (stdruntime_error)
  • // Successfully caught access outside domain.

30
Multi-Stage Filter (low pass filter)
  • templatelttypename Tgt
  • class sample_low_pass_filterltTgt
  • public
  • sample_low_pass_filter()
  • FIR1_(load_w1 (W1_LENGTH), FIR1_LENGTH),
  • FIR2_(load_w2 (W2_LENGTH), FIR2_LENGTH)
  • MatrixltTgt operator () (const MatrixltTgt Input)
  • MatrixltTgt output(ROWS, COLS)
  • for (index row0 rowltROWS row)
  • output.row(row) FIR2_(FIR1_(Input.row(row)
    ).second).second
  • return output
  • private
  • FIRltT, SYMMETRIC_ODD, FIR1_DECIMATION,
    CONTINUOUS, alg_hint()gt FIR1_
  • FIRltT, SYMMETRIC_ODD, FIR2_DECIMATION,
    CONTINUOUS, alg_hint()gt FIR2_

31
Multi-Stage Filter (beam former)
  • templatelttypename Tgt
  • class sample_beamformltTgt
  • public
  • sample_beamform() W3_(load_w3 (ROWS,COLS))
  • MatrixltTgt operator () (const MatrixltTgt Input)
    const
  • return W3_ Input
  • private
  • const MatrixltTgt W3_

32
Multi-Stage Filter (matched filter)
  • templatelttypename Tgt
  • class sample_matched_filterltTgt
  • public
  • matched_filter()
  • W4_(load_w4 (ROWS,COLS)),
  • forward_fft_ (Domainlt2gt(ROWS,COLS), 1.0),
  • inverse_fft_ (Domainlt2gt(ROWS,COLS), 1.0)
  • MatrixltTgt operator () (const MatrixltTgt Input)
    const
  • return inverse_fft_ (forward_fft_ (Input)
    W4_)
  • private
  • const MatrixltTgt W4_
  • FFTltMatrixltTgt, complexltdoublegt,
    complexltdoublegt,
  • FORWARD, 0, MULTIPLE, alg_hint()gt
    forward_fft_
  • FFTltMatrixltTgt, complexltdoublegt,
    complexltdoublegt,

33
Outline
  • Introduction
  • Software Standards
  • Parallel VSIPL
  • Future Challenges
  • Summary

34
Dynamic Mapping for Fault Tolerance
Map1
Nodes 0,2
XOUT
XIN
Map0
Nodes 0,1
Output Task
Input Task
Parallel Processor
  • Switching processors is accomplished by switching
    maps
  • No change to algorithm required

35
Dynamic Mapping Performance Results
Relative Time
Data Size
  • Good dynamic mapping performance is possible

36
Optimal Mapping of Complex Algorithms
Application
Different Optimal Maps
Intel Cluster
Workstation
Embedded Multi-computer
Embedded Board
PowerPC Cluster
Hardware
  • Need to automate process of mapping algorithm to
    hardware

37
Self-optimizing Softwarefor Signal Processing
Problem Size
Large (48x128K)
Small (48x4K)
25 20 15 10
1.5 1.0 0.5
1-1-1-1
1-1-1-1
  • Find
  • Min(latency CPU)
  • Max(throughput CPU)
  • S3P selects correct optimal mapping
  • Excellent agreement between S3P predicted and
    achieved latencies and throughputs

1-1-1-2
Latency (seconds)
1-1-2-1
1-1-2-2
1-2-2-1
1-2-2-2
1-2-2-2
2-2-2-2
1-2-2-3
0.25 0.20 0.15 0.10
5.0 4.0 3.0 2.0
1-2-2-2
1-3-2-2
1-3-2-2
1-2-2-2
1-2-2-1
Throughput (frames/sec)
1-1-2-2
1-1-2-1
1-1-2-1
1-1-1-1
1-1-1-1
4 5 6 7 8
4 5 6 7 8
CPU
CPU
38
High Level Languages
High Performance Matlab Applications
  • Parallel Matlab need has been identified
  • HPCMO (OSU)
  • Required user interface has been demonstrated
  • MatlabP (MIT/LCS)
  • PVL (MIT/LL)
  • Required hardware interface has been demonstrated
  • MatlabMPI (MIT/LL)

DoD Sensor Processing
DoD Mission Planning
Scientific Simulation
Commercial Applications
User Interface
Parallel Matlab Toolbox
Hardware Interface
Parallel Computing Hardware
  • Parallel Matlab Toolbox can now be realized

39
MatlabMPI deployment (speedup)
  • Maui
  • Image filtering benchmark (300x on 304 cpus)
  • Lincoln
  • Signal Processing (7.8x on 8 cpus)
  • Radar simulations (7.5x on 8 cpus)
  • Hyperspectral (2.9x on 3 cpus)
  • MIT
  • LCS Beowulf (11x Gflops on 9 duals)
  • AI Lab face recognition (10x on 8 duals)
  • Other
  • Ohio St. EM Simulations
  • ARL SAR Image Enhancement
  • Wash U Hearing Aid Simulations
  • So. Ill. Benchmarking
  • JHU Digital Beamforming
  • ISL Radar simulation
  • URI Heart modeling

Image Filtering on IBM SP at Maui Computing Center
Performance (Gigaflops)
Number of Processors
  • Rapidly growing MatlabMPI user base demonstrates
    need for parallel matlab
  • Demonstrated scaling to 300 processors

40
Summary
  • HPEC-SI Expected benefit
  • Open software libraries, programming models, and
    standards that provide portability (3x),
    productivity (3x), and performance (1.5x)
    benefits to multiple DoD programs
  • Invitation to Participate
  • DoD Program offices with Signal/Image Processing
    needs
  • Academic and Government Researchers interested in
    high performance embedded computing
  • Contact KEPNER_at_LL.MIT.EDU

41
The LinksHigh Performance Embedded Computing
Workshophttp//www.ll.mit.edu/HPECHigh
Performance Embedded Computing Software
Initiativehttp//www.hpec-si.org/Vector,
Signal, and Image Processing Libraryhttp//www.vs
ipl.org/MPI Software Technologies,
Inc.http//www.mpi-softtech.com/Data
Reorganization Initiativehttp//www.data-re.org/
CodeSourcery, LLChttp//www.codesourcery.com/Mat
labMPIhttp//www.ll.mit.edu/MatlabMPI
Write a Comment
User Comments (0)
About PowerShow.com