Formal Loop Merging for Signal Transforms - PowerPoint PPT Presentation

About This Presentation

Title:

Formal Loop Merging for Signal Transforms

Description:

C compiler cannot do this. hardcoded special case. State-of-the-art ... New SPL Compiler. 18. S-SPL. Four central constructs: S, G, S, Perm (sum) makes loops ... – PowerPoint PPT presentation

Number of Views:35

Avg rating:3.0/5.0

Slides: 33

Provided by: jose271

Learn more at: http://users.ece.cmu.edu

Category:

more less

Transcript and Presenter's Notes

Title: Formal Loop Merging for Signal Transforms

1
Formal Loop Merging for Signal Transforms

Franz Franchetti
Yevgen S. Voronenko
Markus Püschel
Department of Electrical Computer Engineering
Carnegie Mellon University
This work was supported by NSF through awards
0234293, and 0325687. Franz Franchetti was
supported by the Austrian Science Fund FWF,
Erwin Schrödinger Fellowship J2322.

http//www.spiral.net
2
Problem

Runtime of (uniprocessor) numerical applications
typically dominated by few compute-intensive
kernels
Examples discrete Fourier transform,
matrix-matrix multiplication
These kernels are hand-written for every
architecture (open-source and commercial
libraries)
Writing fast numerical code is becoming
increasingly difficult, expensive, and platform
dependent, due to
Complicated memory hierarchies
Special purpose instructions (short vector
extensions, fused multiply-add)
Other microarchitectural features (deep
pipelines, superscalar execution)

3
Example Discrete Fourier Transform (DFT)
Performance on Pentium 4 _at_ 3 GHz
vendor library
10x performance gap
Pseudo-Mflop/s (higher is better)
roughly the same operations count
GNU Scientific Library
log2(size)
Writing fast code is hard. Are there
alternatives?
4
Automatic Code Generation and Adaptation

ATLAS Code generator for basic linear algebra
subroutines (BLAS)
Whaley, et. al., 1998 Yotov, et al., 2005
FFTW Adaptive library for computing the discrete
Fourier transform (DFT) and its variants
Frigo and Johnson, 1998
SPIRAL Code generator for linear signal
transforms (including DFT)
Püschel, et al., 2004
See also Proceedings of the IEEE special issue
on Program Generation, Optimization, and
Adaptation, Feb. 2005.

Focus of this talk
A new approach to automatic loop merging in
SPIRAL

5
Talk Organization

SPIRAL Background
Automatic loop merging in SPIRAL
Experimental Results
Conclusions

6
Talk Organization

SPIRAL Background
Automatic loop merging in SPIRAL
Experimental Results
Conclusions

7
SPIRAL DSP Transforms

SPIRAL generates optimized code for linear signal
transforms, such as discrete Fourier transform
(DFT), discrete cosine transforms, FIR filters,
wavelets, and many others.
Linear transform matrix-vector product
Example DFT of input vector x

output
transform matrix
input
8
SPIRAL Fast Transform Algorithms

Reduce computation cost from O(n2) to O(n log n)
For every transform there are many fast
algorithms
Algorithm sparse matrix factorization
SPIRAL generates the space of algorithms using
breakdown rules in the domain-specific Signal
Processing Language (SPL)

12 adds 4 mults
4 adds
4 adds
1 mult
(when multiplied with input vector x)
9
SPL (Signal Processing Language)

SPL expresses transform algorithms as structured
sparse matrix factorization
Examples
SPL grammar in Backus-Naur form

10
Compiling SPL to Code Using Templates
for i0..n-1 for j0..m-1
yinjxmij
y01n-1 call A(x01n-1) yn1nm-1
call B(xn1nm-1)
for i0..n-1 yim1imm-1 call
B(xim1imm-1)
for i0..n-1 yim1imm-1 call
B(xinim-1)
11
Some Transforms and Breakdown Rules in SPIRAL
Base case rules
Spiral contains 30 transforms and 100 rules
12
SPIRAL Architecture
Approach Empirical search over alternative
recursive algorithms
Transform
Formula Generator
... for (int j0 jlt3 j) y2j
xj xj4 y2j1 xj - xj4
...
... for (int j0 jlt3 j) yj
C1xj C2xj4 yj4 C1xj
C2xj4 ...
SPL Compiler
Search Engine
SPIRAL
C Compiler
2B 33 0F ...
0A FF C4 ...
Timer
80 ns
100 ns
Adapted Implementation
DFT_8.c 80 ns
13
Problem Fusing Permutations and Loops
Two passes over the working set Complex index
computation
void sub(double y, double x) double t8
for (int i0 ilt7 i) t(i/4)2(i4)
xi for (int i0 ilt4 i) y2i
t2i t2i1 y2i1 t2i -
t2i1
direct mapping
C compiler cannot do this
hardcoded special case
One pass over the working set Simple index
computation
State-of-the-art SPIRAL Hardcoded with
templates FFTW Hardcoded in the infrastructure
void sub(double y, double x) for (int j0
jlt3 j) y2j xj xj4
y2j1 xj - xj4
How does hardcoding scale?
14
General Loop Merging Problem
permutations

Combinatorial explosion Implementing templates
for all rules and all recursive combinations is
unfeasible
In many cases even theoretically not understood

15
Our Solution in SPIRAL

Loop merging at C code level impractical
Loop merging at SPL level not possible
Solution
New language ?-SPL an abstraction level between
SPL and code
Loop merging through ?-SPL formula manipulation

16
Talk Organization

SPIRAL Background
Automatic loop merging in SPIRAL
Experimental Results
Conclusions

17
New Approach for Loop Merging
Transform
SPL formula
SPL To S-SPL
Formula Generator
S-SPL formula
SPL Compiler
New SPL Compiler
Search Engine
Loop Merging
S-SPL formula
C Compiler
Index Simplification
Timer
S-SPL formula
S-SPL Compiler
Code
Adapted Implementation
18
S-SPL

Four central constructs S, G, S, Perm
? (sum) makes loops explicit
Gf (gather) reads data using the index mapping
f
Sf (scatter) writes data using the index
mapping f
Permf permutes data using the index mapping
f
Every ?-SPL formula still represents a matrix
factorization

Example
j0
j1
F2
j2
j3
Input
Output
19
Loop Merging With Rewriting Rules
SPL
SPL To S-SPL
S-SPL
Loop Merging
S-SPL
Index Simplification
S-SPL
Rules
S-SPL Compiler
for (int j0 jlt3 j) y2j xj
xj4 y2j1 xj - xj4
Code
20
Application Loop Merging For FFTs
DFT breakdown rules
Cooley-Tukey FFT
Prime factor FFT
Rader FFT
Index mapping functions are non-trivial
21
Example
DFTpq
Given DFTpq p prime p-1 rs
Prime factor
DFTp
VT
V
DFTq
Formula fragment
Rader
DFTp-1
DFTp-1
WT
W
D
Code for one memory access
p7 q4 r3 s2 tx((21((7k ((((((2j
i)/2) 3((2j i)2)) 1)) ? (5pow(3,
((((2j i)/2) 3((2j i)2)) 1)))7
(0)))/7) 8((7k ((((((2j i)/2) 3((2j
i)2)) 1)) ? (5pow(3, ((((2j i)/2)
3((2j i)2)) 1)))7 (0)))7))28)
Cooley-Tukey
DFTr
DFTs
L
D
Task Index simplification
22
Index Simplification Basic Idea

Example Identity necessary for fusing successive
Rader and prime-factor step

Performed at the ?-SPL level through rewrite
rules on function objects

Advantages
no analysis necessary
efficient (or doable at all)

23
Index Simplification Rules for FFTs
Cooley-Tukey
Transitional
Cooley-Tukey Prime factor
Cooley-Tukey Prime factor Rader
These 15 rules cover all combinations. Some
encode novel optimizations.
24
Loop Merging For the FFTs Example (contd)
SPL
SPL To S-SPL
S-SPL
Loop Merging
S-SPL
Index Simplification
// Input _Complex double x28, output y28
int p1, b1 for(int j1 0 j1 lt 3 j1)
y7j1 x(7j128) p1 1 b1
7j1 for(int j0 0 j0 lt 2 j0)
yb1 2j0 1 x(b1 4p1)28 x(b1
24p1)28 yb1 2j0 2 x(b1
4p1)28 - x(b1 24p1)28 p1
(p137)
S-SPL
S-SPL Compiler
Code
25
// Input _Complex double x28, output y28
int p1, b1 for(int j1 0 j1 lt 3 j1)
y7j1 x(7j128) p1 1 b1
7j1 for(int j0 0 j0 lt 2 j0)
yb1 2j0 1 x(b1 4p1)28
x(b1 24p1)28
yb1 2j0 2 x(b1 4p1)28
x(b1 24p1)28
p1 (p137)
// Input _Complex double x28, output
y28 double t128 for(int i5 0 i5 lt 27
i5) t1i5 x(73(i5/7)
42(i57))28 for(int i1 0 i1 lt 3 i1)
double t37, t47, t57 for(int i6
0 i6 lt 6 i6) t5i6 t17i1
i6 for(int i8 0 i8 lt 6 i8)
t4i8 t5i8 ? (5pow(3, i8))7 0
double t71, t81 t80
t40 t70 t80 t30
t70 double t106,
t116, t126 for(int i13 0 i13 lt
5 i13) t12i13 t4i13 1
for(int i14 0 i14 lt 5 i14)
t11i14 t12(i14/2) 3(i142)
for(int i3 0 i3 lt 2 i3)
double t142, t152 for(int i15
0 i15 lt 1 i15) t15i15
t112i3 i15 t140 (t150
t151) t141 (t150 - t151)
for(int i17 0 i17 lt 1 i17)
t102i3 i17 t14i17
for(int i19 0 i19 lt 5 i19)
t3i19 1 t10i19 for(int
i20 0 i20 lt 6 i20) y7i1 i20
t3i20
After, 2 Loops.
Before, 11 Loops.
26
Talk Organization

SPIRAL Background
Automatic loop merging in SPIRAL
Experimental Results
Conclusions

27
Benchmarks Setup

Comparison against FFTW 3.0.1
Pentium 4 3.6 GHz
We consider sizes requiring at least one Rader
step(sizes with large prime factor)
We divide sizes into levels depending on number
of Rader steps needed (Rader FFT has most
expensive index mapping)

28
One Rader Step
Average SPIRAL speedup factor of 2.7
29
Two Rader Steps
Average SPIRAL speedup factor of 3.3
30
Three Rader Steps
Average SPIRAL speedup factor of 3.4
31
Talk Organization

SPIRAL Background
Automatic loop merging in SPIRAL
Experimental Results
Conclusions

32
Conclusion

General loop optimization framework for linear
DSP transforms in SPIRAL
Loop optimization at the right abstraction
level S-SPL
Application to FFT Speedups of a factor of 2-5
over FFTW for sizes with large primes
Future work Other S-SPL optimizations
Loop merging for other transforms
Loop elimination, interchange, peeling

http//www.spiral.net

Write a Comment

User Comments (0)