Formal Loop Merging for Signal Transforms - PowerPoint PPT Presentation

About This Presentation
Title:

Formal Loop Merging for Signal Transforms

Description:

C compiler cannot do this. hardcoded special case. State-of-the-art ... New SPL Compiler. 18. S-SPL. Four central constructs: S, G, S, Perm (sum) makes loops ... – PowerPoint PPT presentation

Number of Views:35
Avg rating:3.0/5.0
Slides: 33
Provided by: jose271
Learn more at: http://users.ece.cmu.edu
Category:

less

Transcript and Presenter's Notes

Title: Formal Loop Merging for Signal Transforms


1
Formal Loop Merging for Signal Transforms
  • Franz Franchetti
  • Yevgen S. Voronenko
  • Markus Püschel
  • Department of Electrical Computer Engineering
  • Carnegie Mellon University
  • This work was supported by NSF through awards
    0234293, and 0325687. Franz Franchetti was
    supported by the Austrian Science Fund FWF,
    Erwin Schrödinger Fellowship J2322.

http//www.spiral.net
2
Problem
  • Runtime of (uniprocessor) numerical applications
    typically dominated by few compute-intensive
    kernels
  • Examples discrete Fourier transform,
    matrix-matrix multiplication
  • These kernels are hand-written for every
    architecture (open-source and commercial
    libraries)
  • Writing fast numerical code is becoming
    increasingly difficult, expensive, and platform
    dependent, due to
  • Complicated memory hierarchies
  • Special purpose instructions (short vector
    extensions, fused multiply-add)
  • Other microarchitectural features (deep
    pipelines, superscalar execution)

3
Example Discrete Fourier Transform (DFT)
Performance on Pentium 4 _at_ 3 GHz
vendor library
10x performance gap
Pseudo-Mflop/s (higher is better)
roughly the same operations count
GNU Scientific Library
log2(size)
Writing fast code is hard. Are there
alternatives?
4
Automatic Code Generation and Adaptation
  • ATLAS Code generator for basic linear algebra
    subroutines (BLAS)
  • Whaley, et. al., 1998 Yotov, et al., 2005
  • FFTW Adaptive library for computing the discrete
    Fourier transform (DFT) and its variants
  • Frigo and Johnson, 1998
  • SPIRAL Code generator for linear signal
    transforms (including DFT)
  • Püschel, et al., 2004
  • See also Proceedings of the IEEE special issue
    on Program Generation, Optimization, and
    Adaptation, Feb. 2005.
  • Focus of this talk
  • A new approach to automatic loop merging in
    SPIRAL

5
Talk Organization
  • SPIRAL Background
  • Automatic loop merging in SPIRAL
  • Experimental Results
  • Conclusions

6
Talk Organization
  • SPIRAL Background
  • Automatic loop merging in SPIRAL
  • Experimental Results
  • Conclusions

7
SPIRAL DSP Transforms
  • SPIRAL generates optimized code for linear signal
    transforms, such as discrete Fourier transform
    (DFT), discrete cosine transforms, FIR filters,
    wavelets, and many others.
  • Linear transform matrix-vector product
  • Example DFT of input vector x

output
transform matrix
input
8
SPIRAL Fast Transform Algorithms
  • Reduce computation cost from O(n2) to O(n log n)
  • For every transform there are many fast
    algorithms
  • Algorithm sparse matrix factorization
  • SPIRAL generates the space of algorithms using
    breakdown rules in the domain-specific Signal
    Processing Language (SPL)

12 adds 4 mults
4 adds
4 adds
1 mult
(when multiplied with input vector x)
9
SPL (Signal Processing Language)
  • SPL expresses transform algorithms as structured
    sparse matrix factorization
  • Examples
  • SPL grammar in Backus-Naur form

10
Compiling SPL to Code Using Templates
for i0..n-1 for j0..m-1
yinjxmij
y01n-1 call A(x01n-1) yn1nm-1
call B(xn1nm-1)
for i0..n-1 yim1imm-1 call
B(xim1imm-1)
for i0..n-1 yim1imm-1 call
B(xinim-1)
11
Some Transforms and Breakdown Rules in SPIRAL
Base case rules
Spiral contains 30 transforms and 100 rules
12
SPIRAL Architecture
Approach Empirical search over alternative
recursive algorithms
Transform
Formula Generator
... for (int j0 jlt3 j) y2j
xj xj4 y2j1 xj - xj4
...
... for (int j0 jlt3 j) yj
C1xj C2xj4 yj4 C1xj
C2xj4 ...
SPL Compiler
Search Engine
SPIRAL
C Compiler
2B 33 0F ...
0A FF C4 ...
Timer
80 ns
100 ns
Adapted Implementation
DFT_8.c 80 ns
13
Problem Fusing Permutations and Loops
Two passes over the working set Complex index
computation
void sub(double y, double x) double t8
for (int i0 ilt7 i) t(i/4)2(i4)
xi for (int i0 ilt4 i) y2i
t2i t2i1 y2i1 t2i -
t2i1
direct mapping
C compiler cannot do this
hardcoded special case
One pass over the working set Simple index
computation
State-of-the-art SPIRAL Hardcoded with
templates FFTW Hardcoded in the infrastructure
void sub(double y, double x) for (int j0
jlt3 j) y2j xj xj4
y2j1 xj - xj4
How does hardcoding scale?
14
General Loop Merging Problem
permutations
  • Combinatorial explosion Implementing templates
    for all rules and all recursive combinations is
    unfeasible
  • In many cases even theoretically not understood

15
Our Solution in SPIRAL
  • Loop merging at C code level impractical
  • Loop merging at SPL level not possible
  • Solution
  • New language ?-SPL an abstraction level between
    SPL and code
  • Loop merging through ?-SPL formula manipulation

16
Talk Organization
  • SPIRAL Background
  • Automatic loop merging in SPIRAL
  • Experimental Results
  • Conclusions

17
New Approach for Loop Merging
Transform
SPL formula
SPL To S-SPL
Formula Generator
S-SPL formula
SPL Compiler
New SPL Compiler
Search Engine
Loop Merging
S-SPL formula
C Compiler
Index Simplification
Timer
S-SPL formula
S-SPL Compiler
Code
Adapted Implementation
18
S-SPL
  • Four central constructs S, G, S, Perm
  • ? (sum) makes loops explicit
  • Gf (gather) reads data using the index mapping
    f
  • Sf (scatter) writes data using the index
    mapping f
  • Permf permutes data using the index mapping
    f
  • Every ?-SPL formula still represents a matrix
    factorization

Example
j0
j1
F2
j2
j3
Input
Output
19
Loop Merging With Rewriting Rules
SPL
SPL To S-SPL
S-SPL
Loop Merging
S-SPL
Index Simplification
S-SPL
Rules
S-SPL Compiler
for (int j0 jlt3 j) y2j xj
xj4 y2j1 xj - xj4
Code
20
Application Loop Merging For FFTs
DFT breakdown rules
Cooley-Tukey FFT
Prime factor FFT
Rader FFT
Index mapping functions are non-trivial
21
Example
DFTpq
Given DFTpq p prime p-1 rs
Prime factor
DFTp
VT
V
DFTq
Formula fragment
Rader
DFTp-1
DFTp-1
WT
W
D
Code for one memory access
p7 q4 r3 s2 tx((21((7k ((((((2j
i)/2) 3((2j i)2)) 1)) ? (5pow(3,
((((2j i)/2) 3((2j i)2)) 1)))7
(0)))/7) 8((7k ((((((2j i)/2) 3((2j
i)2)) 1)) ? (5pow(3, ((((2j i)/2)
3((2j i)2)) 1)))7 (0)))7))28)
Cooley-Tukey
DFTr
DFTs
L
D
Task Index simplification
22
Index Simplification Basic Idea
  • Example Identity necessary for fusing successive
    Rader and prime-factor step
  • Performed at the ?-SPL level through rewrite
    rules on function objects
  • Advantages
  • no analysis necessary
  • efficient (or doable at all)

23
Index Simplification Rules for FFTs
Cooley-Tukey
Transitional
Cooley-Tukey Prime factor
Cooley-Tukey Prime factor Rader
These 15 rules cover all combinations. Some
encode novel optimizations.
24
Loop Merging For the FFTs Example (contd)
SPL
SPL To S-SPL
S-SPL
Loop Merging
S-SPL
Index Simplification
// Input _Complex double x28, output y28
int p1, b1 for(int j1 0 j1 lt 3 j1)
y7j1 x(7j128) p1 1 b1
7j1 for(int j0 0 j0 lt 2 j0)
yb1 2j0 1 x(b1 4p1)28 x(b1
24p1)28 yb1 2j0 2 x(b1
4p1)28 - x(b1 24p1)28 p1
(p137)
S-SPL
S-SPL Compiler
Code
25
// Input _Complex double x28, output y28
int p1, b1 for(int j1 0 j1 lt 3 j1)
y7j1 x(7j128) p1 1 b1
7j1 for(int j0 0 j0 lt 2 j0)
yb1 2j0 1 x(b1 4p1)28
x(b1 24p1)28
yb1 2j0 2 x(b1 4p1)28
x(b1 24p1)28
p1 (p137)
// Input _Complex double x28, output
y28 double t128 for(int i5 0 i5 lt 27
i5) t1i5 x(73(i5/7)
42(i57))28 for(int i1 0 i1 lt 3 i1)
double t37, t47, t57 for(int i6
0 i6 lt 6 i6) t5i6 t17i1
i6 for(int i8 0 i8 lt 6 i8)
t4i8 t5i8 ? (5pow(3, i8))7 0
double t71, t81 t80
t40 t70 t80 t30
t70 double t106,
t116, t126 for(int i13 0 i13 lt
5 i13) t12i13 t4i13 1
for(int i14 0 i14 lt 5 i14)
t11i14 t12(i14/2) 3(i142)
for(int i3 0 i3 lt 2 i3)
double t142, t152 for(int i15
0 i15 lt 1 i15) t15i15
t112i3 i15 t140 (t150
t151) t141 (t150 - t151)
for(int i17 0 i17 lt 1 i17)
t102i3 i17 t14i17
for(int i19 0 i19 lt 5 i19)
t3i19 1 t10i19 for(int
i20 0 i20 lt 6 i20) y7i1 i20
t3i20
After, 2 Loops.
Before, 11 Loops.
26
Talk Organization
  • SPIRAL Background
  • Automatic loop merging in SPIRAL
  • Experimental Results
  • Conclusions

27
Benchmarks Setup
  • Comparison against FFTW 3.0.1
  • Pentium 4 3.6 GHz
  • We consider sizes requiring at least one Rader
    step(sizes with large prime factor)
  • We divide sizes into levels depending on number
    of Rader steps needed (Rader FFT has most
    expensive index mapping)

28
One Rader Step
Average SPIRAL speedup factor of 2.7
29
Two Rader Steps
Average SPIRAL speedup factor of 3.3
30
Three Rader Steps
Average SPIRAL speedup factor of 3.4
31
Talk Organization
  • SPIRAL Background
  • Automatic loop merging in SPIRAL
  • Experimental Results
  • Conclusions

32
Conclusion
  • General loop optimization framework for linear
    DSP transforms in SPIRAL
  • Loop optimization at the right abstraction
    level S-SPL
  • Application to FFT Speedups of a factor of 2-5
    over FFTW for sizes with large primes
  • Future work Other S-SPL optimizations
  • Loop merging for other transforms
  • Loop elimination, interchange, peeling

http//www.spiral.net
Write a Comment
User Comments (0)
About PowerShow.com