Optimisation dapplications Composition de transformations de programme: modle et outils - PowerPoint PPT Presentation

1 / 24
About This Presentation
Title:

Optimisation dapplications Composition de transformations de programme: modle et outils

Description:

Each defined in its own formalism. A few are implemented in compilers. 3 ... Unified Formalism. Example of code optimization. Compilation Process. Conclusion. 7 ... – PowerPoint PPT presentation

Number of Views:52
Avg rating:3.0/5.0
Slides: 25
Provided by: lri
Category:

less

Transcript and Presenter's Notes

Title: Optimisation dapplications Composition de transformations de programme: modle et outils


1
Optimisation dapplications-Composition de
transformations de programme modèle et outils
  • par Sylvain Girbal
  • co-directeur Albert Cohen
  • co-directeur Olivier Temam
  • encadrant au CEA Jacques Raguideau

2
Context
  • Increasing gap between peak performance and
    measured running time
  • Inefficient resource usage
  • Architecture
  • ? Speculation complexity
  • Memory hierarchy
  • Branch predictors
  • Dynamic scheduling
  • Speculative Loads
  • Value prediction
  • ? Resources
  • Functional unit number
  • New units

Performances
Years
  • Compilation
  • Static cost models
  • Inefficient for speculative and dynamic
    mechanisms.
  • Loop nest transformations
  • Each defined in its own formalism
  • A few are implemented in compilers

3
Existing Optimization Frameworks
  • Optimization Tools
  • Mainly for parallelization of sequential codes
  • Drawbacks
  • Black box behavior
  • Restricted Applicability
  • Syntax based optimizers
  • Polaris
  • SUIF
  • ParaScope
  • Sage
  • Polyhedron based optimizers
  • Petit (Omega)
  • MMAlpha
  • PICO
  • PIPS
  • Syntactic Representations
  • Growing complexity
  • Pattern matching rules
  • Phase behavior of compilers
  • Polyhedral Representation
  • Restricted applicability
  • Applicable to kernels only
  • Only implement a few transformations

4
Iterative Compilation
OBoyle
  • Embedded processors
  • Increased Compilation Time
  • Better optimizations
  • Dedicated architectures
  • Application with long lifespan
  • General-Purpose processors
  • Searching best transformation parameters
  • Cope well with architectural changes
  • Efficient with sampling techniques
  • Only a few transformations are considered
  • Decisions mainly based on execution time
  • Restricted search space

feedback
Intermediate Binary
Source
Compiler
Final Binary
5
Manual Optimizations
  • Long sequences of transformations
  • Transformations often hits the same loop nest
  • Some enabling phases degrade performance
  • Mainly regular loop nests
  • High number of loop nests
  • Low number of conditionals
  • Performance loop nests
  • High variance in instruction number
  • Small loop depth

apsi (16)
swim (10)
applu (9)
galgel (23)
6
Outline
  • Code optimization context
  • Unified Formalism
  • Example of code optimization
  • Compilation Process
  • Conclusion

7
Introducing Polyhedra
  • Dependency Dataflow analysis
  • Fine grain (often exact) analysis enhance
    transformation opportunities
  • Based on -polyhedra
  • Polyhedron System of affine inequalities
  • i1 iN j1 jN ijM
  • Matrix representation

Z
Z
0
?
0
8
Defining Execution Order
Feautrier
  • Statements are executed more than once
  • Scheduling orders statement instances
  • lt Statement , Iteration vectorgt
  • ltS1, 3gt ltS3, 8,9gt
  • Associate a timestamp to each instance
  • Scheduling function ?
  • Mono-dimensional scheduling function
  • ?(S1,i)2i ?(S2,i)2i1
    ?(S3,i,j)2M iNj
  • Multi-dimensional scheduling function
  • Dimension d depth(S)
  • ?(S1,i) 2i ?(S2,i)
    2i1 ?(S3,i,j)2Mi,j
  • Dimension 2d1
  • ?(S1,i) 0,i,0 ?(S2,i)
    0,i,1 ?(S3,i,j) 1,i,0,j,0

for(int i0iltMi)S1S2 for(int
i0iltMi)for(int j0jltNj)S3
All instances of statement S2 occur before any
instance of statement S3 The instance ltS1,i7gt
occurs after instance ltS2,i6gt
?(S3,i,j)2M jMi
Interchange
Fusion
?(S1,i)(N2)i
?(S3,i,j)0 i(N2)2j
?(S2,i)(N2)i1
Interchange
?(S3,i,j)2Mj,i
Fusion
?(S1,i)3i
?(S2,i)3i1
?(S3,i,j)3i2,j
Fusion
?(S1,i)0,i,0
?(S2,i)0,i,1
Interchange
?(S3,i,j)1,j,0,i,0
?(S3,i,j)0,i,2,j,0
9
Unified Formalism
  • Unified formalism for code transformations
  • Storing statement-wise information
  • Based on matrix representations
  • Code transformations composition of matrix
    operations
  • Ease the composition of transformations

for(int i0iltMi) Zi0 for(int
j0jltNj)Zi Aij Yj
Domain
Scheduling
Access
Access
Access
10
Statement Control
for(int i0iltMi)S1 Zi 0
for(int j0jltNj)S2 Zi (Aij
Bji)Xj for(int k0kltPk)
for(int l0lltQl)S3 Zk Akl Yl
for(int i0iltMi)S1 Zi 0
for(int j0jltNj)S2 Zi (Aij
Bji)Xj for(int k0kltPk)
for(int l0lltQl)S3 Zk Akl Yl
for(int i0iltMi)S1 Zi 0
for(int j0jltNj)S2 Zi (Aij
Bji)Xj for(int k0kltPk)
for(int l0lltQl)S3 Zk Akl Yl
for(int i0iltMi)S1 Zi 0
for(int j0jltNj)S2 Zi (Aij
Bji)Xj for(int k0kltPk)
for(int l0lltQl)S3 Zk Akl Yl
  • ?S1 0 i lt M
  • ?S1 i0 M-i-10
  • ?S2 0 i lt M 0 j lt N
  • ?S2 i0 M-i-10 j0 N-i-10
  • ?S3 0 k lt P 0 l lt Q
  • ?S3 k0 P-k-10 l0 Q-l-10

11
Statement Scheduling
0
1
for(int i0iltMi)S1 Zi 0
for(int j0jltNj)S2 Zi (Aij
Bji)Xj for(int i0iltPi)
for(int j0jltQj)S3 Zi Aij Yj
for(int i0iltMi)S1 Zi 0
for(int j0jltNj)S2 Zi (Aij
Bji)Xj for(int i0iltPi)
for(int j0jltQj)S3 Zi Aij Yj
for(int i0iltMi)S1 Zi 0
for(int j0jltNj)S2 Zi (Aij
Bji)Xj for(int i0iltPi)
for(int j0jltQj)S3 Zi Aij Yj
for(int i0iltMi)S1 Zi 0
for(int j0jltNj)S2 Zi (Aij
Bji)Xj for(int i0iltPi)
for(int j0jltQj)S3 Zj Aji Yj
0
1
0
0
0
?S2
?S1
?S3
12
Scheduling Separation
0
1
0
1
0
0
0
?S3
?S3
Scheduling vector
  • Fusion
  • Fission
  • Code motion

?S3
?S3
Parameter scheduling matrix
Iteration scheduling matrix
  • Interchange
  • Skewing
  • Reversal
  • Shifting

13
Outline
  • Code optimization context
  • Unified Formalism
  • Example of code optimization
  • Compilation Process
  • Conclusion

14
Code Optimization Example
for(i0iltMi)S1 Zi 0
for(j0jltMj)S2 Zi (Aij
Bji)Xj for(i0iltMi)
for(j0jltMj)S3 Zi Aij Yj
External loop fusion Internal loop fusion Fission
of Z initialization Internal strip-mine External
strip-mine Center loop interchange
Fusion
for(i0iltMi)S1 Zi 0
for(j0jltMj)S2 Zi (Aij
Bji)Xj for(j0jltMj)S3 Zi
Aij Yj
for(i0iltMi)S1 Zi 0
for(j0jltMj)S2 Zi (Aij
Bji)XjS3 Zi Aij Yj
Fusion
Fission
for(i0iltMi)S1 Zi 0
for(ii0iiltM/32ii) for(jj0jjltM/32jj)
for(i32iiiltmin(M,ii3232)i)
for(j32jjjltmin(M,jj3232)j)S2
Zi (Aij Bji)XjS3 Zi
Aij Yj
for(i0iltMi)S1 Zi 0
for(i0iltMi) for(j0jltMj)S2 Zi
(Aij Bji)XjS3 Zi Aij
Yj
Tiling
15
Code Optimization Example
for(i0iltMi)S1 Zi 0
for(j0jltMj)S2 Zi (Aij
Bji)Xj for(i0iltMi)
for(j0jltMj)S3 Zi Aij Yj
External loop fusion Internal loop fusion Fission
of Z initialization Internal strip-mine External
strip-mine Center loop interchange
?S1
?S1
?S1
for(i0iltMi)S1 Zi 0
for(j0jltMj)S2 Zi (Aij
Bji)Xj for(i0iltMi)
for(j0jltMj)S3 Zi Aij Yj
for(i0iltMi)S1 Zi 0
for(j0jltMj)S2 Zi (Aij
Bji)Xj for(j0jltMj)S3 Zi
Aij Yj
for(i0iltMi)S1 Zi 0
for(j0jltMj)S2 Zi (Aij
Bji)XjS3 Zi Aij Yj
for(i0iltMi)S1 Zi 0
for(i0iltMi) for(j0jltMj)S2 Zi
(Aij Bji)XjS3 Zi Aij
Yj
for(i0iltMi)S1 Zi 0
for(i0iltMi) for(jj0jjltM/32jj)
for(j32jjjltmin(M,jj3232)j)S2 Zi
(Aij Bji)XjS3 Zi
Aij Yj
for(i0iltMi)S1 Zi 0
for(ii0iiltM/32ii) for(i32iiiltmin(M,ii
3232)i) for(jj0jjltM/32jj)
for(j32jjjltmin(M,jj3232)j)S2
Zi (Aij Bji)XjS3 Zi
Aij Yj
for(i0iltMi)S1 Zi 0
for(ii0iiltM/32ii) for(jj0jjltM/32jj)
for(i32iiiltmin(M,ii3232)i)
for(j32jjjltmin(M,jj3232)j)S2
Zi (Aij Bji)XjS3 Zi
Aij Yj
?S2
?S2
?S2
?S3
?S3
?S3
16
Composition Issue Strip Mining case
  • Details on Strip-mining
  • Strip-mine
  • Shifting Strip-mine

for(i0iltMi)S1(i)S2(i)
  • Classical Strip-mining
  • Parallel to domain iterators
  • Time Strip-mining
  • Parallel to traversal order
  • Usually preferred

17
Composition Confluence
  • Commutation
  • Transformation targeting different components
    commute
  • Scheduling transformation
  • Transformation changing sequentiality
    (?-Transformation)
  • Transformation changing iteration ordering
    (?-Transformation,?-Transformation)
  • Domain transformation (?-Transformation)
  • Access transformation
  • ?-Transformations commutes with other
    ?-Transformations
  • Dimension transformations do not commute
  • Confluence

for(i0iltMi2)for(j0jltMj)S(i,j)for(
j0jltMj)S(i1,j)
external partial unroll
internal fusions
for(i0iltMi)for(j0jltMj)S(i,j)
for(i0iltMi2)for(j0jltMj)S(i,j)S(i
1,j)
for(i0iltMi2)for(iiiiilti1ii)for(j0
jltMj)S(i,j)
for(i0iltMi2)for(j0jltMj)for(iiiiilt
i1ii)S(ii,j)
Internal full unroll
external strip-mine
internal interchange
Unroll and Jam
18
Outline
  • Code optimization context
  • Unified Formalism
  • Example of code optimization
  • Compilation Process
  • Conclusion

19
Compilation Process
Extraction
Transformations
Code generation
input.src
PreOPT
WRaP-IT
URUK
URGenT
WRaP
WRaP
IR
LNO
C. Bastoul N. Vasilache
S. Sharma
WOPT
IR
CG
output.bin
20
Defining Transformations
  • With a Script language to easily add some new
    transformations.
  • As composition of previously defined
    transformation.
  • Using C to keep syntax close to the formalism.

transformation move param BetaPrefix P param
BetaPrefix Q param Integer o code
dP.dim() foreach WrapStatement S in SCoP
if ((PltS.Beta) (QltS.Beta))
S.Beta(d)o if ((PltS.Beta)
(QltltS.Beta)) S.Beta(d)o
move(P,Q,o) d ? dim(P) ?S ? SCoP P
? ?S ? Q ? ?S ? ?Sd ? ?Sd o P ? ?S ?
Q ltlt ?S ? ?Sd ? ?Sd o
21
Applying Transformations
for(i0iltMi) __URUK_LBL1 S1 Zi
0 for(j0jltMj) __URUK_LBL2 S2
Zi (Aij Bji)Xj
for(i0iltMi) for(j0jltMj)
__URUK_LBL3 S3 Zi Aij Yj
for(i0iltMi)S1 Zi 0
for(j0jltMj)S2 Zi (Aij
Bji)Xj for(i0iltMi)
for(j0jltMj)S3 Zi Aij Yj
IR
External loop fusion Internal loop fusion Fission
of Z initialization Internal strip-mine External
strip-mine Center loop interchange
fusion(enclose(LBL1)) fusion(enclose(LBL2)) split(
enclose(LBL2)) stripmine(enclose(LBL3),32) stripmi
ne(enclose(LBL3,2),32) interchange(enclose(LBL3,3)
)
script
IR
  • Framework inputs
  • Source code
  • Decorated with labels
  • Using compiler internal representation
  • Script
  • Describing transformation to apply
  • Using source labels
  • Framework output
  • Transformed IR

for(i0iltMi)S1 Zi 0
for(ii0iiltM/32ii) for(jj0jjltM/32jj)
for(i32iiiltmin(M,ii3232)i)
for(j32jjjltmin(M,jj3232)j)S2
Zi (Aij Bji)XjS3 Zi
Aij Yj
22
Tools using the Formalism
  • PolyDeps Dependency checker
  • URGenT Code Generator
  • Based on CLooG
  • Taking advantage of formalism invariants
  • Exponential reduction in the number of polyhedral
    computation
  • Reduce memory trace

Checking for applicability
Transformation phase
Transformation phase
Analysis phase
Traditional
Looking for broken dependencies
Transformation phase
Transformation phase
Saving Info on Dependency
Dependency checking
PolyDeps
23
Conclusion Future Works
  • Contributions
  • Program abstraction class SCoPs
  • Good coverage for non pointer intensive codes
  • Formalism for both program and program
    transformations
  • Component separation
  • Eases composition of program transformations
  • Implementation of the compilation framework
  • Definitions close to the formalism, composition
    oriented
  • Usable by non-expert users
  • swim SpecFP 2000 benchmark
  • More than 30 speedup compared to best compilers.

24
Conclusion Future Works
  • Ongoing Future Works
  • Optimize more spec benchmarks ? automatically
  • Searching for transformation opportunity
  • Opportunities as transformations (code motion)
  • Engineering
  • WRaP-IT enhancements (modulo) ? better SCoP
    coverage
  • Array generation ? How to provide array memory
    mapping information
  • A language for URUK Scripts ? Polyhedral
    meta-programming
  • Extending the notion of labels
  • Miss a way to target range of source code
  • Toward instruction instances
  • Integration in an iterative compilation framework
Write a Comment
User Comments (0)
About PowerShow.com