Title: Automated Code Generation and High-Level Code Optimization in the Cactus Framework Ian Hinder, Erik Schnetter, Gabrielle Allen, Steven R. Brandt, and Frank L
1Automated Code Generation and High-Level Code
Optimization in the Cactus FrameworkIan Hinder,
Erik Schnetter,Gabrielle Allen, Steven R.
Brandt, and Frank Löffler
2The Cactus Framework
- Component Based Software Engineering Framework
- Metaphor
- Central infrastructure The Flesh
- Programmer defined modules Thorns
- General purpose toolkit, but created to support
General Relativity - Originally created by Paul Walker and Joan Masso
at the Albert Einstein Institute
3The Cactus Framework
- Schedule Tree
- Sequential workflow with steps that run in
parallel - Starts with a predefined set of steps for
initialization, pre-evolve, evolve, post-evolve,
etc. - Groups
- Programmer-defined step that can be added to the
schedule tree
4The Cactus Framework
- Grid Functions
- Multi-Dimensional Arrays
- Span Multiple Compute Nodes
- Synchronized through Ghost Zones and/or
Interpolation, Managed by Driver Layer - Cactus Configuration files (CCL)
- Domain specific configuration language that
describes groups, grid functions, input
parameters, etc.
5The Cactus Framework
- Supports C, C, F77, F90
- Programmer does not have to worry about
parallelism or mesh refinement, the abstraction
enables parallel Adaptive Mesh Refinement or
unigrid code that scales to thousands of cores - Checkpoint/Restart to other architectures/core
counts - Simulation control, data analysis, visualization
support - Active user community, 13 years old
- Enables collaborative development
6- A tool for generating thorns
- Turns Mathematica equations into code Cactus
configuration files - Original version written by S. Husa, Ian Hinder,
C. Lechner at AEI and Southampton - Used for many scientific papers
- Allows user to specify continuum problem without
concentrating on numerical implementation
7- Minimally, a Kranc script contains
- Cactus Group Definitions
- A set of Calculations
- A calculation is a Kranc data structure which
describes how to update a set of grid functions
from other grid functions. - String identifies schedule group.
8Kranc Features
- Tensorial input checked for index consistency
- dothla,lb -gt -2 alpha Kla,lb.
- Calculation can be performed on Interior,
Boundary, or Everywhere. - Allows simple boundary conditions to be coded.
- Generated thorn can inherit from other thorns
- http//kranccode.org
9DSL For Kranc
- Mathematica also presents some difficulties...
- Would like to use more mathematical notation
Ga_bc instead of Gua,lb,lc - Error messages from Mathematica are obscure and
reproduce all of Kranc input - Small parsing expression grammar framework can
help here
10Input File for Wave Eqn
- _at_THORN SimpleWave
- _at_DERIVATIVES
- PDstandard2ndi_ -gt StandardCenteredDifferenceO
perator1,1,i, - PDstandard2ndi_, i_ -gt StandardCenteredDiffere
nceOperator2,1,i, - PDstandard2thi_, j_ -gt StandardCenteredDiffere
nceOperator1,1,i - StandardCenteredDiffere
nceOperator1,1,j - _at_END_DERIVATIVES
- _at_TENSORS
- phi, pi
- _at_END_TENSORS
- _at_GROUPS
- phi -gt "phi_group",
- pi -gt "pi_group"
- _at_END_GROUPS
- _at_DEFINE PD PDstandard2nd
- ...
... _at_CALCULATION "initial_sine" _at_Schedule "AT
INITIAL" _at_EQUATIONS phi -gt Sin2 Pi (x -
t), pi -gt -2 Pi Cos2 Pi (x - t)
_at_END_EQUATIONS _at_END_CALCULATION _at_CALCULATION
"calc_rhs" _at_Schedule "in MoL_CalcRHS"
_at_EQUATIONS dotphi -gt pi, dotpi -gt
Eucui,uj PDphi,li,lj _at_END_EQUATIONS _at_END_CAL
CULATION _at_END_THORN
11The Einstein Equations
It looks so simple... it's just 10 coupled
non-linear partial differential equations
12The Einstein Eqnsin Mathematica
- dirua -gt Signbetaua,
- epsdissua -gt EpsDiss,
- detgt -gt 1,
- gtuua,ub -gt 1/detgt detgtExpr
MatrixInverse gtua,ub, - Gtlla,lb,lc -gt 1/2
- (PDgtlb,la,lc
PDgtlc,la,lb - PDgtlb,lc,la), - Gtlula,lb,uc -gt gtuuc,ud Gtlla,lb,ld,
- Gtua,lb,lc -gt gtuua,ud Gtlld,lb,lc,
- Xtnui -gt gtuuj,uk Gtui,lj,lk,
- Rtli,lj -gt - (1/2) gtuul,um
PDgtli,lj,ll,lm - (1/2) gtlk,li PDXtuk,lj
- (1/2) gtlk,lj PDXtuk,li
- (1/2) Xtnuk Gtlli,lj,lk
- (1/2) Xtnuk Gtllj,li,lk
guua,ub -gt em4phi gtuua,ub,
Rla,lb -gt Rtla,lb Rphila,lb, (
Matter terms ) rho -gt addMatter
(1/alpha2 (T00 - 2 betaui T0li betaui
betauj Tli,lj)), Sli -gt addMatter
(-1/alpha (T0li - betauj Tli,lj)), trS
-gt addMatter (em4phi gtuui,uj Tli,lj),
( RHS terms ) dotphi -gt IfThen
conformalMethod, 1/3 phi, -1/6 alpha trK
Upwindbetaua, phi, la
epsdissua PDdissphi,la
IfThen conformalMethod, -1/3
phi, 1/6 PDbetaua,la, dotgtla,lb -gt
- 2 alpha Atla,lb
Upwindbetauc, gtla,lb, lc
epsdissuc PDdissgtla,lb,lc
gtla,lc PDbetauc,lb
gtlb,lc PDbetauc,la -
(2/3) gtla,lb PDbetauc,lc, dotXtui
-gt - 2 Atuui,uj PDalpha,lj
2 alpha ( Gtui,lj,lk Atuuk,uj
- (2/3) gtuui,uj
PDtrK,lj 6
Atuui,uj cdphilj)
gtuuj,ul PDbetaui,lj,ll
(1/3) gtuui,uj PDbetaul,lj,ll
Upwindbetauj, Xtui, lj
epsdissuj PDdissXtui,lj
- Xtnuj PDbetaui,lj
(2/3) Xtnui PDbetauj,lj
addMatter (- 16 pi alpha
gtuui,uj Slj), dotXtui -gt
dotXtui, dottrK -gt - em4phi (
gtuua,ub ( PDalpha,la,lb
2 cdphila PDalpha,lb )
- Xtnua PDalpha,la )
alpha (Atmua,lb Atmub,la
(1/3) trK2)
Upwindbetaua, trK, la
epsdissua PDdisstrK,la
addMatter (4 pi alpha (rho trS)),
dottrK -gt dottrK,
Atsla,lb -gt - CDtalpha,la,lb
2 (PDalpha,la cdphilb
PDalpha,lb cdphila )
alpha Rla,lb, trAts -gt guua,ub
Atsla,lb, dotAtla,lb -gt em4phi (
Atsla,lb - (1/3) gla,lb trAts )
alpha (trK Atla,lb - 2 Atla,lc
Atmuc,lb)
Upwindbetauc, Atla,lb, lc
epsdissuc PDdissAtla,lb,lc
Atla,lc PDbetauc,lb
Atlb,lc PDbetauc,la -
(2/3) Atla,lb PDbetauc,lc
addMatter (- em4phi alpha 8 pi
(Tla,lb - (1/3)
gla,lb trS)), dotalpha -gt - harmonicF
alphaharmonicN (
LapseACoeff A (1 -
LapseACoeff) trK)
LapseAdvectionCoeff Upwindbetaua, alpha, la
epsdissua PDdissalpha,la,
dotA -gt LapseACoeff (dottrK -
AlphaDriver A)
LapseAdvectionCoeff Upwindbetaua, A, la
epsdissua PDdissA,la, eta
-gt etaExpr, theta -gt thetaExpr,
dotbetaua -gt theta ShiftGammaCoeff
( ShiftBCoeff Bua
(1 - ShiftBCoeff) (Xtua - eta
BetaDriver betaua))
ShiftAdvectionCoeff Upwindbetaub, betaua,
lb epsdissub
PDdissbetaua,lb, dotBua -gt
ShiftBCoeff (dotXtua - eta BetaDriver Bua)
ShiftAdvectionCoeff
Upwindbetaub, Bua, lb
- ShiftAdvectionCoeff Upwindbetaub, Xtua,
lb epsdissub
PDdissBua,lb
13The EinsteinEquations in C
- CCTK_REAL rho INV(SQR(alphaL))(eTttL -
2(beta2LeTtyL - beta3LeTtzL) 2(beta1L(-eTtxL
beta2LeTxyL beta3LeTxzL) - beta2Lbeta3LeTyzL) eTxxLSQR(beta1L)
eTyyLSQR(beta2L) - eTzzLSQR(beta3L))
-
- CCTK_REAL S1 (-eTtxL beta1LeTxxL
beta2LeTxyL - beta3LeTxzL)INV(alphaL)
-
- CCTK_REAL S2 (-eTtyL beta1LeTxyL
beta2LeTyyL - beta3LeTyzL)INV(alphaL)
-
- CCTK_REAL S3 (-eTtzL beta1LeTxzL
beta2LeTyzL - beta3LeTzzL)INV(alphaL)
-
- CCTK_REAL trS em4phi(eTxxLgtu11
eTyyLgtu22 2(eTxyLgtu12 - eTxzLgtu13 eTyzLgtu23) eTzzLgtu33)
-
- CCTK_REAL phirhsL epsdiss1PDdissipationNth1
phi - epsdiss2PDdissipationNth2phi
epsdiss3PDdissipationNth3phi
14The Einstein Equations
- It's one kernel
- Split in two parts to avoid instruction cache
misses - Hard for compilers to optimize/vectorize
- Examined assembly output
- No fission, blocking, unrolling, vectorization,
... - No useful compiler reports either
15Gµ? Slow!
- All modern CPUs (Intel, AMD, IBM Blue Gene, IBM
Power, even GPUs) execute floating point
instructions in vector units - Typically operating on 2 or 4 values
simultaneously (GPUs up to 32 values, warp) - Scalar code just throws away 50 or more of TPP
(theoretical peak performance)
16Alternative ApproachUse Kranc
- We already use Automated Code Generation (ACG) to
generate our loop kernels - Actually, to generate complete Cactus components
- Many advantages reduce code size, ensure
correctness, easier experimentation and
modification, - Also can apply high-level and low-level
optimizations that compilers do not know about - General or domain-specific optimizations
- Cannot always wait for optimization-related bug
fixes to compilers and adding a stage to ACG
framework is not too complex
17Vectorization Design
- High-level architecture-independent API for
vectorized code - Provides implementations for various
architectures - Intel/AMD SSE2, Intel/AMD AVX, Blue Gene Double
Hummer, Power 7 VSX (as well as trivial scalar
pseudo-vectorization) - Vectorized code needs to be modified for
- Data types, load/store operations, arithmetic
operations, loop strides - API imposes restrictions not all scalar code can
be easily vectorized - Memory alignment may require array padding for
higher performance
18VectorizationImplementation
- First attempt used C, templates, and inline
functions to ensure readable code - Failure very slow when debugging, many compiler
segfaults (GCC and Intel are most stable
compilers) - Now using macros defined in LSUThorns/Vectors
(part of the EinsteinToolkit release called
"Curie") - Does not look as nice, e.g. add(x, mul(y, z))
- Targeting mostly stencil-based loop kernels
- All Kranc codes can benefit immediately
- Capabilities of different architectures
surprisingly similar - Exceptions unaligned memory access, partial
vector load/store
19VectorizationImplementation
- Break up code into vectors wj uj
vj becomes wj, wj1 uj, uj1 vj,
vj1 - Mathematica transformation rules are
applied e.g. xx_ yy_ -gt kmulxx,yy
20Vectorization API Functionality
- Aligned load, unaligned load
- Compile-time logic to avoid unnecessary unaligned
loads when accessing multi-D array elements - Aligned store, partial vector store, store while
bypassing cache - Load/store API not used symmetrically in
applications! - Using partial vector operations at loop
boundaries (instead of scalar operations) to
avoid fixup code - Arithmetic operations, including
abs/max/min/sqrt/?/etc. - Also including fma (fused multiply-add)
operations explicitly, since compilers dont seem
to automatically generate fma instructions for
vectors...
21Performance Numbers
System CPU Compiler Speedup
Bethe Intel Intel 1.07
Blue Drop Power 7 IBM NDA
Curry Intel GCC 2.20 (?)
Hopper AMD GCC 1.47
Kraken AMD Intel 2.17 (?)
Ranger AMD Intel 1.41
- Performance improvements depend on many details,
in particular (for our code) i-cache misses
- Theoretical improvement factor 2
22Automated CodeGen with Cactus/Kranc
- Provides manageable high-level interface to
equations - Improves programmability
- Provides opportunities for specialized
optimizations - Reduces boilerplate