Title: A Language for the Compact Representation of Multiple Program Versions
1A Language for the Compact Representation of
Multiple Program Versions
- Sébastien Donadio1,2, James Brodman3, Thomas
Roeder4, - Kamen Yotov4, Denis Barthou2, Albert Cohen5,
- María Jesús Garzarán3, David Padua3, and Keshav
Pingali4
1 BULL S.A. 2 University of Versailles 3
University of Illinois at Urbana-Champaign 4
Cornell University 5 INRIA Futurs
International Workshop LCPC 2005
2Outline
- Context in optimization for high performance
- Goals of this language
- Features of this language
- Examples (Daxpy Dgemm)
- Conclusion
3Context
- Complex architecture and fragile optimizations
- Unpredictable performance
- Architecture, domain-specific optimizations
- Resort to empirical search
- Complement general-purpose optimizations with
user-driven ones
4Example FFT performance
best available implementation (FFTW, Intel IPP,
Spiral)
Reasonable implementation (Numerical recipes. GNU
scientific library)
5Goals of X-Language
- Tool to help programmers generate and evaluate
multiple versions of their programs - Applying control and data structure
transformations - Trying multiple transformation sequences and
parameters - Evaluating performance of each version and taking
decisions about which transformation variants to
try
6Goals of X-Language (cont.)
- The code must be portable accross ISO-C
compilers - Use pragma annotations for the above tasks
- Observable program semantics not altered by the
interpretation of these pragmas (assuming
transformation legality)
7Comparaison with related works
8Features of the language
- Elementary transformations (fission, stripmining,
interchanging, unrolling,) - Composition of transformations
- Conditional transformations (versioning)
- Procedural abstraction of transformations
- A mechanism to define new transformations
- No validity check is performed for the
transformation
9General schema of X-Language
Code with Pragmas
Transformation Descriptions
search
Different versions
Compile
Execute and measure performance
10X-Language
- Naming loops or scopes
- pragma xlang name loop1
- for(i0ilt10i) ai4
- Format of transformation
- pragma xlang stripmine loop1 4 ii
Transformation name
Loop name
Name of additional loops generated by
transformations
pragma xlang
parameters
11Elementary transformations implemented in
X-language
- Full unrolling
- Partial unrolling
- Scalar promote
- Interchange
- Loop fission
- Loop fusion
- Strip mining
- Lifting
- Sofware pipelining
12Applying transformation
- pragma xlang loop1
- for(iminilt4maxi)
- aibi
- pragma xlang stripmine loop1 4 ii
13How to search the value of parameters ?
- Using multistage evaluation
- External script
- for(k1klt16k2k)
-
- pragma xlang loop1
- for(iminiltmaxi)
- aibi
- pragma xlang stripmine loop1 d(k) ii
14Composing transformations
- pragma xlang loop1
- for(i0ilt4i)
- pragma xlang loop2
- for(jmin2jltmax2j)
- aibj
- pragma xlang interchange loop1 loop2
- pragma xlang fullunroll loop1
15Analyses and Transformations
- Static analyses should also enable the design of
smarter (higher level) transformation primitives - External tool to find information
16Example with analysis
for(i2ilt2Ni2) uiui-1ui-2 ui1
uiui-1
17Extending the X-Language
Rewriting rule pragma xlang name iloop for (i
0 i lt N i) ltbodygt
pragma xlang name iiloop1 for (ii 0 ii lt
(N/4)4 ii 4) pragma xlang name
iloop1 for (i ii i lt ii4 i)
ltbodygt pragma xlang name iloop2 for (i
(N/4)4 i lt N i) f ltbodygt
Pattern before ? Pattern after transformation
18Daxpy Example
- pragma xlang name loop1
- for(k0klt2000k)
- YkalphaXkYk
- We can modify values of N
- / A few values tested for unrolling factor
Different generated version / - pragma xlang transform stripmine loop1 k N
- pragma xlang transform scalarize-in X in loop1
- pragma xlang transform lift l1.loads before
loop1 - pragma xlang transform scalarize-out Y in loop1
- pragma xlang transform lift loop1.loads before
loop1 - pragma xlang transform lift loop1.stores after
loop1 - pragma xlang transform fullunroll loop1.loads
- pragma xlang transform fullunroll loop1.stores
- pragma xlang transform fullunroll loop1
19Daxpy Example Different generated versions
Unrolling factor 8 for(k0klt2000kk16)
double x_0 Xk0 double x_1 Xk1
double x_2 Xk2 y_0alphax_0y_0
y_1alphax_1y_1 y_2alphax_2y_2
y_3alphax_3y_3 Yk0 y_0 Yk1
y_1 Yk2 y_2 Yk3 y_3
- Unrolling factor 2
- for(k0klt2000kk2)
- double x_0 Xk0
- double x_1 Xk1
- double y_0 Yk0
- double y_1 Yk1
- y_0alphax_0y_0
- y_1alphax_1y_1
- Yk0 y_0
- Yk1 y_1
-
Unrolling factor 4 for(k0klt2000kk4)
double x_0 Xk0 double x_1 Xk1
double x_2 Xk2 double x_3 Xk3
double y_0 Yk0 double y_1 Yk1
double y_2 Yk2 double y_3 Yk3
y_0alphax_0y_0 y_1alphax_1y_1
y_2alphax_2y_2 y_3alphax_3y_3
Yk0 y_0 Yk1 y_1 Yk2 y_2
20Matrix Multiply(Loop Declaration)
- The DGEMM example
- Matrix Multiplication
- Problems
- Data locality
- Scheduling
- pragma xlang name iloop
- for (i 0 i lt NB i)
- pragma xlang name jloop
- for (j 0 j lt NB j)
- pragma xlang name kloop
- for (k 0 k lt NB k)
- cijcijaikbkj
-
-
21Matrix Multiply(Transformation Declaration)
Sequence of transformations for Itanium
- pragma xlang transform stripmine iloop NU NUloop
- pragma xlang transform stripmine jloop MU MUloop
- pragma xlang transform interchange kloop MUloop
- pragma xlang transform interchange jloop NUloop
- pragma xlang transform interchange kloop NUloop
- pragma xlang transform fullunroll NUloop
- pragma xlang transform fullunroll MUloop
- pragma xlang transform scalarize_in b in kloop
- pragma xlang transform scalarize_in a in kloop
- pragma xlang transform scalarize_inout c in
kloop - pragma xlang transform lift kloop.loads before
kloop - pragma xlang transform lift kloop.stores after
kloop
22Matrix Multiply(Transformation Sequence)
pragma xlang name iloop for(i 0 i lt NB
i) pragma xlang name jloop for(j 0 j lt NB
j 4) pragma xlang name kloop.loads c_0_0
ci0j0 c_0_1 ci0j1 c_0_2
ci0j2 c_0_3 ci0j3 pragma xlang
name kloop for(k 0 k lt NB k) a_0
ai0k a_1 ai0k a_2 ai0k a_3
ai0k
- b_0 bkj0
- b_1 bkj1
- b_2 bkj2
- b_3 bkj3
- c_0_0c_0_0a_0b_0
- c_0_1c_0_1a_1b_1
- c_0_2c_0_2a_2b_2
- c_0_3c_0_3a_3b_3
- ...
-
- pragma xlang name kloop.stores
- ci0j0 c_0_0
- ci0j1 c_0_1
- ci0j2 c_0_2
- ci0j3 c_0_3
-
- ... // Remainder code
23Block copies
- Block Matrix Multiplication better performance
if matrices are contiguous in memory (TLB) - Poor performance of C copy
- Resort to a tool generating specific asm code
- Tool generating a good code with search (XLG is
an asm search)
24Matrix Multiply(Results)
25Conclusion
- Describe transformations with reuse, procedures,
conditionals - X-Language
- language designed to generate multiversion
programs - Multistage language with a flexible
pattern-matching and rewriting language - Experts can describe specific application
transformation optimizations
26Future works
- Dependence analysis
- Going further searching asm code transformation
- More transformations vectorization, alignment,