Automated Code Generation and High-Level Code Optimization in the Cactus Framework Ian Hinder, Erik Schnetter, Gabrielle Allen, Steven R. Brandt, and Frank L - PowerPoint PPT Presentation

About This Presentation

Title:

Automated Code Generation and High-Level Code Optimization in the Cactus Framework Ian Hinder, Erik Schnetter, Gabrielle Allen, Steven R. Brandt, and Frank L

Description:

Kranc Features Tensorial input checked for index consistency dot[h[la,lb]] - -2 alpha K[la,lb]. Calculation can be performed on Interior, Boundary, or Everywhere. – PowerPoint PPT presentation

Number of Views:26

Avg rating:3.0/5.0

Slides: 23

Provided by: hpcPnlGo

Learn more at: https://hpc.pnl.gov

Category:

more less

Transcript and Presenter's Notes

Title: Automated Code Generation and High-Level Code Optimization in the Cactus Framework Ian Hinder, Erik Schnetter, Gabrielle Allen, Steven R. Brandt, and Frank L

1
Automated Code Generation and High-Level Code
Optimization in the Cactus FrameworkIan Hinder,
Erik Schnetter,Gabrielle Allen, Steven R.
Brandt, and Frank Löffler
2
The Cactus Framework

Component Based Software Engineering Framework
Metaphor
Central infrastructure The Flesh
Programmer defined modules Thorns
General purpose toolkit, but created to support
General Relativity
Originally created by Paul Walker and Joan Masso
at the Albert Einstein Institute

3
The Cactus Framework

Schedule Tree
Sequential workflow with steps that run in
parallel
Starts with a predefined set of steps for
initialization, pre-evolve, evolve, post-evolve,
etc.
Groups
Programmer-defined step that can be added to the
schedule tree

4
The Cactus Framework

Grid Functions
Multi-Dimensional Arrays
Span Multiple Compute Nodes
Synchronized through Ghost Zones and/or
Interpolation, Managed by Driver Layer
Cactus Configuration files (CCL)
Domain specific configuration language that
describes groups, grid functions, input
parameters, etc.

5
The Cactus Framework

Supports C, C, F77, F90
Programmer does not have to worry about
parallelism or mesh refinement, the abstraction
enables parallel Adaptive Mesh Refinement or
unigrid code that scales to thousands of cores
Checkpoint/Restart to other architectures/core
counts
Simulation control, data analysis, visualization
support
Active user community, 13 years old
Enables collaborative development

A tool for generating thorns
Turns Mathematica equations into code Cactus
configuration files
Original version written by S. Husa, Ian Hinder,
C. Lechner at AEI and Southampton
Used for many scientific papers
Allows user to specify continuum problem without
concentrating on numerical implementation

Minimally, a Kranc script contains
Cactus Group Definitions
A set of Calculations
A calculation is a Kranc data structure which
describes how to update a set of grid functions
from other grid functions.
String identifies schedule group.

8
Kranc Features

Tensorial input checked for index consistency
dothla,lb -gt -2 alpha Kla,lb.
Calculation can be performed on Interior,
Boundary, or Everywhere.
Allows simple boundary conditions to be coded.
Generated thorn can inherit from other thorns
http//kranccode.org

9
DSL For Kranc

Mathematica also presents some difficulties...
Would like to use more mathematical notation
Ga_bc instead of Gua,lb,lc
Error messages from Mathematica are obscure and
reproduce all of Kranc input
Small parsing expression grammar framework can
help here

10
Input File for Wave Eqn

_at_THORN SimpleWave
_at_DERIVATIVES
PDstandard2ndi_ -gt StandardCenteredDifferenceO
perator1,1,i,
PDstandard2ndi_, i_ -gt StandardCenteredDiffere
nceOperator2,1,i,
PDstandard2thi_, j_ -gt StandardCenteredDiffere
nceOperator1,1,i
StandardCenteredDiffere
nceOperator1,1,j
_at_END_DERIVATIVES
_at_TENSORS
phi, pi
_at_END_TENSORS
_at_GROUPS
phi -gt "phi_group",
pi -gt "pi_group"
_at_END_GROUPS
_at_DEFINE PD PDstandard2nd
...

... _at_CALCULATION "initial_sine" _at_Schedule "AT
INITIAL" _at_EQUATIONS phi -gt Sin2 Pi (x -
t), pi -gt -2 Pi Cos2 Pi (x - t)
_at_END_EQUATIONS _at_END_CALCULATION _at_CALCULATION
"calc_rhs" _at_Schedule "in MoL_CalcRHS"
_at_EQUATIONS dotphi -gt pi, dotpi -gt
Eucui,uj PDphi,li,lj _at_END_EQUATIONS _at_END_CAL
CULATION _at_END_THORN
11
The Einstein Equations

Gµ?8pTµ?

It looks so simple... it's just 10 coupled
non-linear partial differential equations
12
The Einstein Eqnsin Mathematica

dirua -gt Signbetaua,
epsdissua -gt EpsDiss,
detgt -gt 1,
gtuua,ub -gt 1/detgt detgtExpr
MatrixInverse gtua,ub,
Gtlla,lb,lc -gt 1/2
(PDgtlb,la,lc
PDgtlc,la,lb - PDgtlb,lc,la),
Gtlula,lb,uc -gt gtuuc,ud Gtlla,lb,ld,
Gtua,lb,lc -gt gtuua,ud Gtlld,lb,lc,
Xtnui -gt gtuuj,uk Gtui,lj,lk,
Rtli,lj -gt - (1/2) gtuul,um
PDgtli,lj,ll,lm
(1/2) gtlk,li PDXtuk,lj
(1/2) gtlk,lj PDXtuk,li
(1/2) Xtnuk Gtlli,lj,lk
(1/2) Xtnuk Gtllj,li,lk

guua,ub -gt em4phi gtuua,ub,
Rla,lb -gt Rtla,lb Rphila,lb, (
Matter terms ) rho -gt addMatter
(1/alpha2 (T00 - 2 betaui T0li betaui
betauj Tli,lj)), Sli -gt addMatter
(-1/alpha (T0li - betauj Tli,lj)), trS
-gt addMatter (em4phi gtuui,uj Tli,lj),
( RHS terms ) dotphi -gt IfThen
conformalMethod, 1/3 phi, -1/6 alpha trK
Upwindbetaua, phi, la
epsdissua PDdissphi,la
IfThen conformalMethod, -1/3
phi, 1/6 PDbetaua,la, dotgtla,lb -gt
- 2 alpha Atla,lb
Upwindbetauc, gtla,lb, lc
epsdissuc PDdissgtla,lb,lc
gtla,lc PDbetauc,lb
gtlb,lc PDbetauc,la -
(2/3) gtla,lb PDbetauc,lc, dotXtui
-gt - 2 Atuui,uj PDalpha,lj
2 alpha ( Gtui,lj,lk Atuuk,uj
- (2/3) gtuui,uj
PDtrK,lj 6
Atuui,uj cdphilj)
gtuuj,ul PDbetaui,lj,ll
(1/3) gtuui,uj PDbetaul,lj,ll
Upwindbetauj, Xtui, lj
epsdissuj PDdissXtui,lj
- Xtnuj PDbetaui,lj
(2/3) Xtnui PDbetauj,lj
addMatter (- 16 pi alpha
gtuui,uj Slj), dotXtui -gt
dotXtui, dottrK -gt - em4phi (
gtuua,ub ( PDalpha,la,lb
2 cdphila PDalpha,lb )
- Xtnua PDalpha,la )
alpha (Atmua,lb Atmub,la
(1/3) trK2)
Upwindbetaua, trK, la
epsdissua PDdisstrK,la
addMatter (4 pi alpha (rho trS)),
dottrK -gt dottrK,
Atsla,lb -gt - CDtalpha,la,lb
2 (PDalpha,la cdphilb
PDalpha,lb cdphila )
alpha Rla,lb, trAts -gt guua,ub
Atsla,lb, dotAtla,lb -gt em4phi (
Atsla,lb - (1/3) gla,lb trAts )
alpha (trK Atla,lb - 2 Atla,lc
Atmuc,lb)
Upwindbetauc, Atla,lb, lc
epsdissuc PDdissAtla,lb,lc
Atla,lc PDbetauc,lb
Atlb,lc PDbetauc,la -
(2/3) Atla,lb PDbetauc,lc
addMatter (- em4phi alpha 8 pi
(Tla,lb - (1/3)
gla,lb trS)), dotalpha -gt - harmonicF
alphaharmonicN (
LapseACoeff A (1 -
LapseACoeff) trK)
LapseAdvectionCoeff Upwindbetaua, alpha, la
epsdissua PDdissalpha,la,
dotA -gt LapseACoeff (dottrK -
AlphaDriver A)
LapseAdvectionCoeff Upwindbetaua, A, la
epsdissua PDdissA,la, eta
-gt etaExpr, theta -gt thetaExpr,
dotbetaua -gt theta ShiftGammaCoeff
( ShiftBCoeff Bua
(1 - ShiftBCoeff) (Xtua - eta
BetaDriver betaua))
ShiftAdvectionCoeff Upwindbetaub, betaua,
lb epsdissub
PDdissbetaua,lb, dotBua -gt
ShiftBCoeff (dotXtua - eta BetaDriver Bua)
ShiftAdvectionCoeff
Upwindbetaub, Bua, lb
- ShiftAdvectionCoeff Upwindbetaub, Xtua,
lb epsdissub
PDdissBua,lb
13
The EinsteinEquations in C

CCTK_REAL rho INV(SQR(alphaL))(eTttL -
2(beta2LeTtyL
beta3LeTtzL) 2(beta1L(-eTtxL
beta2LeTxyL beta3LeTxzL)
beta2Lbeta3LeTyzL) eTxxLSQR(beta1L)
eTyyLSQR(beta2L)
eTzzLSQR(beta3L))
CCTK_REAL S1 (-eTtxL beta1LeTxxL
beta2LeTxyL
beta3LeTxzL)INV(alphaL)
CCTK_REAL S2 (-eTtyL beta1LeTxyL
beta2LeTyyL
beta3LeTyzL)INV(alphaL)
CCTK_REAL S3 (-eTtzL beta1LeTxzL
beta2LeTyzL
beta3LeTzzL)INV(alphaL)
CCTK_REAL trS em4phi(eTxxLgtu11
eTyyLgtu22 2(eTxyLgtu12
eTxzLgtu13 eTyzLgtu23) eTzzLgtu33)
CCTK_REAL phirhsL epsdiss1PDdissipationNth1
phi
epsdiss2PDdissipationNth2phi
epsdiss3PDdissipationNth3phi

14
The Einstein Equations

It's one kernel
Split in two parts to avoid instruction cache
misses
Hard for compilers to optimize/vectorize
Examined assembly output
No fission, blocking, unrolling, vectorization,
...
No useful compiler reports either

15
Gµ? Slow!

All modern CPUs (Intel, AMD, IBM Blue Gene, IBM
Power, even GPUs) execute floating point
instructions in vector units
Typically operating on 2 or 4 values
simultaneously (GPUs up to 32 values, warp)
Scalar code just throws away 50 or more of TPP
(theoretical peak performance)

16
Alternative ApproachUse Kranc

We already use Automated Code Generation (ACG) to
generate our loop kernels
Actually, to generate complete Cactus components
Many advantages reduce code size, ensure
correctness, easier experimentation and
modification,
Also can apply high-level and low-level
optimizations that compilers do not know about
General or domain-specific optimizations
Cannot always wait for optimization-related bug
fixes to compilers and adding a stage to ACG
framework is not too complex

17
Vectorization Design

High-level architecture-independent API for
vectorized code
Provides implementations for various
architectures
Intel/AMD SSE2, Intel/AMD AVX, Blue Gene Double
Hummer, Power 7 VSX (as well as trivial scalar
pseudo-vectorization)
Vectorized code needs to be modified for
Data types, load/store operations, arithmetic
operations, loop strides
API imposes restrictions not all scalar code can
be easily vectorized
Memory alignment may require array padding for
higher performance

18
VectorizationImplementation

First attempt used C, templates, and inline
functions to ensure readable code
Failure very slow when debugging, many compiler
segfaults (GCC and Intel are most stable
compilers)
Now using macros defined in LSUThorns/Vectors
(part of the EinsteinToolkit release called
"Curie")
Does not look as nice, e.g. add(x, mul(y, z))
Targeting mostly stencil-based loop kernels
All Kranc codes can benefit immediately
Capabilities of different architectures
surprisingly similar
Exceptions unaligned memory access, partial
vector load/store

19
VectorizationImplementation

Break up code into vectors wj uj
vj becomes wj, wj1 uj, uj1 vj,
vj1
Mathematica transformation rules are
applied e.g. xx_ yy_ -gt kmulxx,yy

20
Vectorization API Functionality

Aligned load, unaligned load
Compile-time logic to avoid unnecessary unaligned
loads when accessing multi-D array elements
Aligned store, partial vector store, store while
bypassing cache
Load/store API not used symmetrically in
applications!
Using partial vector operations at loop
boundaries (instead of scalar operations) to
avoid fixup code
Arithmetic operations, including
abs/max/min/sqrt/?/etc.
Also including fma (fused multiply-add)
operations explicitly, since compilers dont seem
to automatically generate fma instructions for
vectors...

21
Performance Numbers
System CPU Compiler Speedup
Bethe Intel Intel 1.07
Blue Drop Power 7 IBM NDA
Curry Intel GCC 2.20 (?)
Hopper AMD GCC 1.47
Kraken AMD Intel 2.17 (?)
Ranger AMD Intel 1.41