Automated Code Generation and High-Level Code Optimization in the Cactus Framework Ian Hinder, Erik Schnetter, Gabrielle Allen, Steven R. Brandt, and Frank L - PowerPoint PPT Presentation

About This Presentation
Title:

Automated Code Generation and High-Level Code Optimization in the Cactus Framework Ian Hinder, Erik Schnetter, Gabrielle Allen, Steven R. Brandt, and Frank L

Description:

Kranc Features Tensorial input checked for index consistency dot[h[la,lb]] - -2 alpha K[la,lb]. Calculation can be performed on Interior, Boundary, or Everywhere. – PowerPoint PPT presentation

Number of Views:26
Avg rating:3.0/5.0
Slides: 23
Provided by: hpcPnlGo
Learn more at: https://hpc.pnl.gov
Category:

less

Transcript and Presenter's Notes

Title: Automated Code Generation and High-Level Code Optimization in the Cactus Framework Ian Hinder, Erik Schnetter, Gabrielle Allen, Steven R. Brandt, and Frank L


1
Automated Code Generation and High-Level Code
Optimization in the Cactus FrameworkIan Hinder,
Erik Schnetter,Gabrielle Allen, Steven R.
Brandt, and Frank Löffler
2
The Cactus Framework
  • Component Based Software Engineering Framework
  • Metaphor
  • Central infrastructure The Flesh
  • Programmer defined modules Thorns
  • General purpose toolkit, but created to support
    General Relativity
  • Originally created by Paul Walker and Joan Masso
    at the Albert Einstein Institute

3
The Cactus Framework
  • Schedule Tree
  • Sequential workflow with steps that run in
    parallel
  • Starts with a predefined set of steps for
    initialization, pre-evolve, evolve, post-evolve,
    etc.
  • Groups
  • Programmer-defined step that can be added to the
    schedule tree

4
The Cactus Framework
  • Grid Functions
  • Multi-Dimensional Arrays
  • Span Multiple Compute Nodes
  • Synchronized through Ghost Zones and/or
    Interpolation, Managed by Driver Layer
  • Cactus Configuration files (CCL)
  • Domain specific configuration language that
    describes groups, grid functions, input
    parameters, etc.

5
The Cactus Framework
  • Supports C, C, F77, F90
  • Programmer does not have to worry about
    parallelism or mesh refinement, the abstraction
    enables parallel Adaptive Mesh Refinement or
    unigrid code that scales to thousands of cores
  • Checkpoint/Restart to other architectures/core
    counts
  • Simulation control, data analysis, visualization
    support
  • Active user community, 13 years old
  • Enables collaborative development

6
  • A tool for generating thorns
  • Turns Mathematica equations into code Cactus
    configuration files
  • Original version written by S. Husa, Ian Hinder,
    C. Lechner at AEI and Southampton
  • Used for many scientific papers
  • Allows user to specify continuum problem without
    concentrating on numerical implementation

7
  • Minimally, a Kranc script contains
  • Cactus Group Definitions
  • A set of Calculations
  • A calculation is a Kranc data structure which
    describes how to update a set of grid functions
    from other grid functions.
  • String identifies schedule group.

8
Kranc Features
  • Tensorial input checked for index consistency
  • dothla,lb -gt -2 alpha Kla,lb.
  • Calculation can be performed on Interior,
    Boundary, or Everywhere.
  • Allows simple boundary conditions to be coded.
  • Generated thorn can inherit from other thorns
  • http//kranccode.org

9
DSL For Kranc
  • Mathematica also presents some difficulties...
  • Would like to use more mathematical notation
    Ga_bc instead of Gua,lb,lc
  • Error messages from Mathematica are obscure and
    reproduce all of Kranc input
  • Small parsing expression grammar framework can
    help here

10
Input File for Wave Eqn
  • _at_THORN SimpleWave
  • _at_DERIVATIVES
  • PDstandard2ndi_ -gt StandardCenteredDifferenceO
    perator1,1,i,
  • PDstandard2ndi_, i_ -gt StandardCenteredDiffere
    nceOperator2,1,i,
  • PDstandard2thi_, j_ -gt StandardCenteredDiffere
    nceOperator1,1,i
  • StandardCenteredDiffere
    nceOperator1,1,j
  • _at_END_DERIVATIVES
  • _at_TENSORS
  • phi, pi
  • _at_END_TENSORS
  • _at_GROUPS
  • phi -gt "phi_group",
  • pi -gt "pi_group"
  • _at_END_GROUPS
  • _at_DEFINE PD PDstandard2nd
  • ...

... _at_CALCULATION "initial_sine" _at_Schedule "AT
INITIAL" _at_EQUATIONS phi -gt Sin2 Pi (x -
t), pi -gt -2 Pi Cos2 Pi (x - t)
_at_END_EQUATIONS _at_END_CALCULATION _at_CALCULATION
"calc_rhs" _at_Schedule "in MoL_CalcRHS"
_at_EQUATIONS dotphi -gt pi, dotpi -gt
Eucui,uj PDphi,li,lj _at_END_EQUATIONS _at_END_CAL
CULATION _at_END_THORN
11
The Einstein Equations
  • Gµ?8pTµ?

It looks so simple... it's just 10 coupled
non-linear partial differential equations
12
The Einstein Eqnsin Mathematica
  • dirua -gt Signbetaua,
  • epsdissua -gt EpsDiss,
  • detgt -gt 1,
  • gtuua,ub -gt 1/detgt detgtExpr
    MatrixInverse gtua,ub,
  • Gtlla,lb,lc -gt 1/2
  • (PDgtlb,la,lc
    PDgtlc,la,lb - PDgtlb,lc,la),
  • Gtlula,lb,uc -gt gtuuc,ud Gtlla,lb,ld,
  • Gtua,lb,lc -gt gtuua,ud Gtlld,lb,lc,
  • Xtnui -gt gtuuj,uk Gtui,lj,lk,
  • Rtli,lj -gt - (1/2) gtuul,um
    PDgtli,lj,ll,lm
  • (1/2) gtlk,li PDXtuk,lj
  • (1/2) gtlk,lj PDXtuk,li
  • (1/2) Xtnuk Gtlli,lj,lk
  • (1/2) Xtnuk Gtllj,li,lk

guua,ub -gt em4phi gtuua,ub,
Rla,lb -gt Rtla,lb Rphila,lb, (
Matter terms ) rho -gt addMatter
(1/alpha2 (T00 - 2 betaui T0li betaui
betauj Tli,lj)), Sli -gt addMatter
(-1/alpha (T0li - betauj Tli,lj)), trS
-gt addMatter (em4phi gtuui,uj Tli,lj),
( RHS terms ) dotphi -gt IfThen
conformalMethod, 1/3 phi, -1/6 alpha trK
Upwindbetaua, phi, la
epsdissua PDdissphi,la
IfThen conformalMethod, -1/3
phi, 1/6 PDbetaua,la, dotgtla,lb -gt
- 2 alpha Atla,lb
Upwindbetauc, gtla,lb, lc
epsdissuc PDdissgtla,lb,lc
gtla,lc PDbetauc,lb
gtlb,lc PDbetauc,la -
(2/3) gtla,lb PDbetauc,lc, dotXtui
-gt - 2 Atuui,uj PDalpha,lj
2 alpha ( Gtui,lj,lk Atuuk,uj
- (2/3) gtuui,uj
PDtrK,lj 6
Atuui,uj cdphilj)
gtuuj,ul PDbetaui,lj,ll
(1/3) gtuui,uj PDbetaul,lj,ll
Upwindbetauj, Xtui, lj
epsdissuj PDdissXtui,lj
- Xtnuj PDbetaui,lj
(2/3) Xtnui PDbetauj,lj
addMatter (- 16 pi alpha
gtuui,uj Slj), dotXtui -gt
dotXtui, dottrK -gt - em4phi (
gtuua,ub ( PDalpha,la,lb
2 cdphila PDalpha,lb )
- Xtnua PDalpha,la )
alpha (Atmua,lb Atmub,la
(1/3) trK2)
Upwindbetaua, trK, la
epsdissua PDdisstrK,la
addMatter (4 pi alpha (rho trS)),
dottrK -gt dottrK,
Atsla,lb -gt - CDtalpha,la,lb
2 (PDalpha,la cdphilb
PDalpha,lb cdphila )
alpha Rla,lb, trAts -gt guua,ub
Atsla,lb, dotAtla,lb -gt em4phi (
Atsla,lb - (1/3) gla,lb trAts )
alpha (trK Atla,lb - 2 Atla,lc
Atmuc,lb)
Upwindbetauc, Atla,lb, lc
epsdissuc PDdissAtla,lb,lc
Atla,lc PDbetauc,lb
Atlb,lc PDbetauc,la -
(2/3) Atla,lb PDbetauc,lc
addMatter (- em4phi alpha 8 pi
(Tla,lb - (1/3)
gla,lb trS)), dotalpha -gt - harmonicF
alphaharmonicN (
LapseACoeff A (1 -
LapseACoeff) trK)
LapseAdvectionCoeff Upwindbetaua, alpha, la
epsdissua PDdissalpha,la,
dotA -gt LapseACoeff (dottrK -
AlphaDriver A)
LapseAdvectionCoeff Upwindbetaua, A, la
epsdissua PDdissA,la, eta
-gt etaExpr, theta -gt thetaExpr,
dotbetaua -gt theta ShiftGammaCoeff
( ShiftBCoeff Bua
(1 - ShiftBCoeff) (Xtua - eta
BetaDriver betaua))
ShiftAdvectionCoeff Upwindbetaub, betaua,
lb epsdissub
PDdissbetaua,lb, dotBua -gt
ShiftBCoeff (dotXtua - eta BetaDriver Bua)
ShiftAdvectionCoeff
Upwindbetaub, Bua, lb
- ShiftAdvectionCoeff Upwindbetaub, Xtua,
lb epsdissub
PDdissBua,lb
13
The EinsteinEquations in C
  • CCTK_REAL rho INV(SQR(alphaL))(eTttL -
    2(beta2LeTtyL
  • beta3LeTtzL) 2(beta1L(-eTtxL
    beta2LeTxyL beta3LeTxzL)
  • beta2Lbeta3LeTyzL) eTxxLSQR(beta1L)
    eTyyLSQR(beta2L)
  • eTzzLSQR(beta3L))
  • CCTK_REAL S1 (-eTtxL beta1LeTxxL
    beta2LeTxyL
  • beta3LeTxzL)INV(alphaL)
  • CCTK_REAL S2 (-eTtyL beta1LeTxyL
    beta2LeTyyL
  • beta3LeTyzL)INV(alphaL)
  • CCTK_REAL S3 (-eTtzL beta1LeTxzL
    beta2LeTyzL
  • beta3LeTzzL)INV(alphaL)
  • CCTK_REAL trS em4phi(eTxxLgtu11
    eTyyLgtu22 2(eTxyLgtu12
  • eTxzLgtu13 eTyzLgtu23) eTzzLgtu33)
  • CCTK_REAL phirhsL epsdiss1PDdissipationNth1
    phi
  • epsdiss2PDdissipationNth2phi
    epsdiss3PDdissipationNth3phi

14
The Einstein Equations
  • It's one kernel
  • Split in two parts to avoid instruction cache
    misses
  • Hard for compilers to optimize/vectorize
  • Examined assembly output
  • No fission, blocking, unrolling, vectorization,
    ...
  • No useful compiler reports either

15
Gµ? Slow!
  • All modern CPUs (Intel, AMD, IBM Blue Gene, IBM
    Power, even GPUs) execute floating point
    instructions in vector units
  • Typically operating on 2 or 4 values
    simultaneously (GPUs up to 32 values, warp)
  • Scalar code just throws away 50 or more of TPP
    (theoretical peak performance)

16
Alternative ApproachUse Kranc
  • We already use Automated Code Generation (ACG) to
    generate our loop kernels
  • Actually, to generate complete Cactus components
  • Many advantages reduce code size, ensure
    correctness, easier experimentation and
    modification,
  • Also can apply high-level and low-level
    optimizations that compilers do not know about
  • General or domain-specific optimizations
  • Cannot always wait for optimization-related bug
    fixes to compilers and adding a stage to ACG
    framework is not too complex

17
Vectorization Design
  • High-level architecture-independent API for
    vectorized code
  • Provides implementations for various
    architectures
  • Intel/AMD SSE2, Intel/AMD AVX, Blue Gene Double
    Hummer, Power 7 VSX (as well as trivial scalar
    pseudo-vectorization)
  • Vectorized code needs to be modified for
  • Data types, load/store operations, arithmetic
    operations, loop strides
  • API imposes restrictions not all scalar code can
    be easily vectorized
  • Memory alignment may require array padding for
    higher performance

18
VectorizationImplementation
  • First attempt used C, templates, and inline
    functions to ensure readable code
  • Failure very slow when debugging, many compiler
    segfaults (GCC and Intel are most stable
    compilers)
  • Now using macros defined in LSUThorns/Vectors
    (part of the EinsteinToolkit release called
    "Curie")
  • Does not look as nice, e.g. add(x, mul(y, z))
  • Targeting mostly stencil-based loop kernels
  • All Kranc codes can benefit immediately
  • Capabilities of different architectures
    surprisingly similar
  • Exceptions unaligned memory access, partial
    vector load/store

19
VectorizationImplementation
  • Break up code into vectors wj uj
    vj becomes wj, wj1 uj, uj1 vj,
    vj1
  • Mathematica transformation rules are
    applied e.g. xx_ yy_ -gt kmulxx,yy

20
Vectorization API Functionality
  • Aligned load, unaligned load
  • Compile-time logic to avoid unnecessary unaligned
    loads when accessing multi-D array elements
  • Aligned store, partial vector store, store while
    bypassing cache
  • Load/store API not used symmetrically in
    applications!
  • Using partial vector operations at loop
    boundaries (instead of scalar operations) to
    avoid fixup code
  • Arithmetic operations, including
    abs/max/min/sqrt/?/etc.
  • Also including fma (fused multiply-add)
    operations explicitly, since compilers dont seem
    to automatically generate fma instructions for
    vectors...

21
Performance Numbers
System CPU Compiler Speedup
Bethe Intel Intel 1.07
Blue Drop Power 7 IBM NDA
Curry Intel GCC 2.20 (?)
Hopper AMD GCC 1.47
Kraken AMD Intel 2.17 (?)
Ranger AMD Intel 1.41
  • Performance improvements depend on many details,
    in particular (for our code) i-cache misses
  • Theoretical improvement factor 2

22
Automated CodeGen with Cactus/Kranc
  • Provides manageable high-level interface to
    equations
  • Improves programmability
  • Provides opportunities for specialized
    optimizations
  • Reduces boilerplate
Write a Comment
User Comments (0)
About PowerShow.com