Title: Optimisation dapplications Composition de transformations de programme: modle et outils
1Optimisation dapplications-Composition de
transformations de programme modèle et outils
- par Sylvain Girbal
- co-directeur Albert Cohen
- co-directeur Olivier Temam
- encadrant au CEA Jacques Raguideau
2Context
- Increasing gap between peak performance and
measured running time - Inefficient resource usage
- Architecture
- ? Speculation complexity
- Memory hierarchy
- Branch predictors
- Dynamic scheduling
- Speculative Loads
- Value prediction
- ? Resources
- Functional unit number
- New units
Performances
Years
- Compilation
- Static cost models
- Inefficient for speculative and dynamic
mechanisms. - Loop nest transformations
- Each defined in its own formalism
- A few are implemented in compilers
3Existing Optimization Frameworks
- Optimization Tools
- Mainly for parallelization of sequential codes
- Drawbacks
- Black box behavior
- Restricted Applicability
- Syntax based optimizers
- Polaris
- SUIF
- ParaScope
- Sage
- Polyhedron based optimizers
- Petit (Omega)
- MMAlpha
- PICO
- PIPS
- Syntactic Representations
- Growing complexity
- Pattern matching rules
- Phase behavior of compilers
- Polyhedral Representation
- Restricted applicability
- Applicable to kernels only
- Only implement a few transformations
4Iterative Compilation
OBoyle
- Embedded processors
- Increased Compilation Time
- Better optimizations
- Dedicated architectures
- Application with long lifespan
- General-Purpose processors
- Searching best transformation parameters
- Cope well with architectural changes
- Efficient with sampling techniques
- Only a few transformations are considered
- Decisions mainly based on execution time
- Restricted search space
feedback
Intermediate Binary
Source
Compiler
Final Binary
5Manual Optimizations
- Long sequences of transformations
- Transformations often hits the same loop nest
- Some enabling phases degrade performance
- Mainly regular loop nests
- High number of loop nests
- Low number of conditionals
- Performance loop nests
- High variance in instruction number
- Small loop depth
apsi (16)
swim (10)
applu (9)
galgel (23)
6Outline
- Code optimization context
- Unified Formalism
- Example of code optimization
- Compilation Process
- Conclusion
7Introducing Polyhedra
- Dependency Dataflow analysis
- Fine grain (often exact) analysis enhance
transformation opportunities - Based on -polyhedra
- Polyhedron System of affine inequalities
- i1 iN j1 jN ijM
- Matrix representation
Z
Z
0
?
0
8Defining Execution Order
Feautrier
- Statements are executed more than once
- Scheduling orders statement instances
- lt Statement , Iteration vectorgt
- ltS1, 3gt ltS3, 8,9gt
- Associate a timestamp to each instance
- Scheduling function ?
- Mono-dimensional scheduling function
- ?(S1,i)2i ?(S2,i)2i1
?(S3,i,j)2M iNj - Multi-dimensional scheduling function
- Dimension d depth(S)
- ?(S1,i) 2i ?(S2,i)
2i1 ?(S3,i,j)2Mi,j - Dimension 2d1
- ?(S1,i) 0,i,0 ?(S2,i)
0,i,1 ?(S3,i,j) 1,i,0,j,0
for(int i0iltMi)S1S2 for(int
i0iltMi)for(int j0jltNj)S3
All instances of statement S2 occur before any
instance of statement S3 The instance ltS1,i7gt
occurs after instance ltS2,i6gt
?(S3,i,j)2M jMi
Interchange
Fusion
?(S1,i)(N2)i
?(S3,i,j)0 i(N2)2j
?(S2,i)(N2)i1
Interchange
?(S3,i,j)2Mj,i
Fusion
?(S1,i)3i
?(S2,i)3i1
?(S3,i,j)3i2,j
Fusion
?(S1,i)0,i,0
?(S2,i)0,i,1
Interchange
?(S3,i,j)1,j,0,i,0
?(S3,i,j)0,i,2,j,0
9Unified Formalism
- Unified formalism for code transformations
- Storing statement-wise information
- Based on matrix representations
- Code transformations composition of matrix
operations - Ease the composition of transformations
for(int i0iltMi) Zi0 for(int
j0jltNj)Zi Aij Yj
Domain
Scheduling
Access
Access
Access
10Statement Control
for(int i0iltMi)S1 Zi 0
for(int j0jltNj)S2 Zi (Aij
Bji)Xj for(int k0kltPk)
for(int l0lltQl)S3 Zk Akl Yl
for(int i0iltMi)S1 Zi 0
for(int j0jltNj)S2 Zi (Aij
Bji)Xj for(int k0kltPk)
for(int l0lltQl)S3 Zk Akl Yl
for(int i0iltMi)S1 Zi 0
for(int j0jltNj)S2 Zi (Aij
Bji)Xj for(int k0kltPk)
for(int l0lltQl)S3 Zk Akl Yl
for(int i0iltMi)S1 Zi 0
for(int j0jltNj)S2 Zi (Aij
Bji)Xj for(int k0kltPk)
for(int l0lltQl)S3 Zk Akl Yl
- ?S1 0 i lt M
- ?S1 i0 M-i-10
- ?S2 0 i lt M 0 j lt N
- ?S2 i0 M-i-10 j0 N-i-10
- ?S3 0 k lt P 0 l lt Q
- ?S3 k0 P-k-10 l0 Q-l-10
11Statement Scheduling
0
1
for(int i0iltMi)S1 Zi 0
for(int j0jltNj)S2 Zi (Aij
Bji)Xj for(int i0iltPi)
for(int j0jltQj)S3 Zi Aij Yj
for(int i0iltMi)S1 Zi 0
for(int j0jltNj)S2 Zi (Aij
Bji)Xj for(int i0iltPi)
for(int j0jltQj)S3 Zi Aij Yj
for(int i0iltMi)S1 Zi 0
for(int j0jltNj)S2 Zi (Aij
Bji)Xj for(int i0iltPi)
for(int j0jltQj)S3 Zi Aij Yj
for(int i0iltMi)S1 Zi 0
for(int j0jltNj)S2 Zi (Aij
Bji)Xj for(int i0iltPi)
for(int j0jltQj)S3 Zj Aji Yj
0
1
0
0
0
?S2
?S1
?S3
12Scheduling Separation
0
1
0
1
0
0
0
?S3
?S3
Scheduling vector
- Fusion
- Fission
- Code motion
?S3
?S3
Parameter scheduling matrix
Iteration scheduling matrix
- Interchange
- Skewing
- Reversal
13Outline
- Code optimization context
- Unified Formalism
- Example of code optimization
- Compilation Process
- Conclusion
14Code Optimization Example
for(i0iltMi)S1 Zi 0
for(j0jltMj)S2 Zi (Aij
Bji)Xj for(i0iltMi)
for(j0jltMj)S3 Zi Aij Yj
External loop fusion Internal loop fusion Fission
of Z initialization Internal strip-mine External
strip-mine Center loop interchange
Fusion
for(i0iltMi)S1 Zi 0
for(j0jltMj)S2 Zi (Aij
Bji)Xj for(j0jltMj)S3 Zi
Aij Yj
for(i0iltMi)S1 Zi 0
for(j0jltMj)S2 Zi (Aij
Bji)XjS3 Zi Aij Yj
Fusion
Fission
for(i0iltMi)S1 Zi 0
for(ii0iiltM/32ii) for(jj0jjltM/32jj)
for(i32iiiltmin(M,ii3232)i)
for(j32jjjltmin(M,jj3232)j)S2
Zi (Aij Bji)XjS3 Zi
Aij Yj
for(i0iltMi)S1 Zi 0
for(i0iltMi) for(j0jltMj)S2 Zi
(Aij Bji)XjS3 Zi Aij
Yj
Tiling
15Code Optimization Example
for(i0iltMi)S1 Zi 0
for(j0jltMj)S2 Zi (Aij
Bji)Xj for(i0iltMi)
for(j0jltMj)S3 Zi Aij Yj
External loop fusion Internal loop fusion Fission
of Z initialization Internal strip-mine External
strip-mine Center loop interchange
?S1
?S1
?S1
for(i0iltMi)S1 Zi 0
for(j0jltMj)S2 Zi (Aij
Bji)Xj for(i0iltMi)
for(j0jltMj)S3 Zi Aij Yj
for(i0iltMi)S1 Zi 0
for(j0jltMj)S2 Zi (Aij
Bji)Xj for(j0jltMj)S3 Zi
Aij Yj
for(i0iltMi)S1 Zi 0
for(j0jltMj)S2 Zi (Aij
Bji)XjS3 Zi Aij Yj
for(i0iltMi)S1 Zi 0
for(i0iltMi) for(j0jltMj)S2 Zi
(Aij Bji)XjS3 Zi Aij
Yj
for(i0iltMi)S1 Zi 0
for(i0iltMi) for(jj0jjltM/32jj)
for(j32jjjltmin(M,jj3232)j)S2 Zi
(Aij Bji)XjS3 Zi
Aij Yj
for(i0iltMi)S1 Zi 0
for(ii0iiltM/32ii) for(i32iiiltmin(M,ii
3232)i) for(jj0jjltM/32jj)
for(j32jjjltmin(M,jj3232)j)S2
Zi (Aij Bji)XjS3 Zi
Aij Yj
for(i0iltMi)S1 Zi 0
for(ii0iiltM/32ii) for(jj0jjltM/32jj)
for(i32iiiltmin(M,ii3232)i)
for(j32jjjltmin(M,jj3232)j)S2
Zi (Aij Bji)XjS3 Zi
Aij Yj
?S2
?S2
?S2
?S3
?S3
?S3
16Composition Issue Strip Mining case
- Details on Strip-mining
- Strip-mine
- Shifting Strip-mine
for(i0iltMi)S1(i)S2(i)
- Classical Strip-mining
- Parallel to domain iterators
- Time Strip-mining
- Parallel to traversal order
- Usually preferred
17Composition Confluence
- Commutation
- Transformation targeting different components
commute - Scheduling transformation
- Transformation changing sequentiality
(?-Transformation) - Transformation changing iteration ordering
(?-Transformation,?-Transformation) - Domain transformation (?-Transformation)
- Access transformation
- ?-Transformations commutes with other
?-Transformations - Dimension transformations do not commute
- Confluence
for(i0iltMi2)for(j0jltMj)S(i,j)for(
j0jltMj)S(i1,j)
external partial unroll
internal fusions
for(i0iltMi)for(j0jltMj)S(i,j)
for(i0iltMi2)for(j0jltMj)S(i,j)S(i
1,j)
for(i0iltMi2)for(iiiiilti1ii)for(j0
jltMj)S(i,j)
for(i0iltMi2)for(j0jltMj)for(iiiiilt
i1ii)S(ii,j)
Internal full unroll
external strip-mine
internal interchange
Unroll and Jam
18Outline
- Code optimization context
- Unified Formalism
- Example of code optimization
- Compilation Process
- Conclusion
19Compilation Process
Extraction
Transformations
Code generation
input.src
PreOPT
WRaP-IT
URUK
URGenT
WRaP
WRaP
IR
LNO
C. Bastoul N. Vasilache
S. Sharma
WOPT
IR
CG
output.bin
20Defining Transformations
- With a Script language to easily add some new
transformations. - As composition of previously defined
transformation. - Using C to keep syntax close to the formalism.
transformation move param BetaPrefix P param
BetaPrefix Q param Integer o code
dP.dim() foreach WrapStatement S in SCoP
if ((PltS.Beta) (QltS.Beta))
S.Beta(d)o if ((PltS.Beta)
(QltltS.Beta)) S.Beta(d)o
move(P,Q,o) d ? dim(P) ?S ? SCoP P
? ?S ? Q ? ?S ? ?Sd ? ?Sd o P ? ?S ?
Q ltlt ?S ? ?Sd ? ?Sd o
21Applying Transformations
for(i0iltMi) __URUK_LBL1 S1 Zi
0 for(j0jltMj) __URUK_LBL2 S2
Zi (Aij Bji)Xj
for(i0iltMi) for(j0jltMj)
__URUK_LBL3 S3 Zi Aij Yj
for(i0iltMi)S1 Zi 0
for(j0jltMj)S2 Zi (Aij
Bji)Xj for(i0iltMi)
for(j0jltMj)S3 Zi Aij Yj
IR
External loop fusion Internal loop fusion Fission
of Z initialization Internal strip-mine External
strip-mine Center loop interchange
fusion(enclose(LBL1)) fusion(enclose(LBL2)) split(
enclose(LBL2)) stripmine(enclose(LBL3),32) stripmi
ne(enclose(LBL3,2),32) interchange(enclose(LBL3,3)
)
script
IR
- Framework inputs
- Source code
- Decorated with labels
- Using compiler internal representation
- Script
- Describing transformation to apply
- Using source labels
- Framework output
- Transformed IR
for(i0iltMi)S1 Zi 0
for(ii0iiltM/32ii) for(jj0jjltM/32jj)
for(i32iiiltmin(M,ii3232)i)
for(j32jjjltmin(M,jj3232)j)S2
Zi (Aij Bji)XjS3 Zi
Aij Yj
22Tools using the Formalism
- PolyDeps Dependency checker
- URGenT Code Generator
- Based on CLooG
- Taking advantage of formalism invariants
- Exponential reduction in the number of polyhedral
computation - Reduce memory trace
Checking for applicability
Transformation phase
Transformation phase
Analysis phase
Traditional
Looking for broken dependencies
Transformation phase
Transformation phase
Saving Info on Dependency
Dependency checking
PolyDeps
23Conclusion Future Works
- Contributions
- Program abstraction class SCoPs
- Good coverage for non pointer intensive codes
- Formalism for both program and program
transformations - Component separation
- Eases composition of program transformations
- Implementation of the compilation framework
- Definitions close to the formalism, composition
oriented - Usable by non-expert users
- swim SpecFP 2000 benchmark
- More than 30 speedup compared to best compilers.
24Conclusion Future Works
- Ongoing Future Works
- Optimize more spec benchmarks ? automatically
- Searching for transformation opportunity
- Opportunities as transformations (code motion)
- Engineering
- WRaP-IT enhancements (modulo) ? better SCoP
coverage - Array generation ? How to provide array memory
mapping information - A language for URUK Scripts ? Polyhedral
meta-programming - Extending the notion of labels
- Miss a way to target range of source code
- Toward instruction instances
- Integration in an iterative compilation framework