The Synergy between Computer Science and CFD Modeling - PowerPoint PPT Presentation

1 / 37
About This Presentation
Title:

The Synergy between Computer Science and CFD Modeling

Description:

This will not speed up small models due to the overhead of message passing ... The result: More robust, reliable modeling, at much higher speed, with less effort ... – PowerPoint PPT presentation

Number of Views:58
Avg rating:3.0/5.0
Slides: 38
Provided by: dovkr
Category:

less

Transcript and Presenter's Notes

Title: The Synergy between Computer Science and CFD Modeling


1
The Synergy between Computer Science and CFD
Modeling
Dov Kruger Anne Pence Alan Blumberg
2
Background of the Davidson Laboratory Team
  • Professor Alan Blumberg, co-author of POM and
    author of ECOMSED
  • Dov Kruger, computer science, transitioning into
    ocean research
  • Anne Pence, PhD candidate with a background in
    naval architecture and Oceanography

3
CFD is a Computing War Against Big Problems
  • Overview of 5 key areas, and focus on our
    research in single and multiple CPU performance
    and accuracy
  • Understanding the battleground Architectural
    Features of Computers
  • Having a tactical plan coding efficiency
  • Effective Use of resources
  • Parallelism on shared memory architectures
  • Auto-parallelizing compilers
  • OpenMP
  • Parallelism using distributed computers (MPI)
  • Hitting the correct target Accuracy

4
Ignorance is useful, knowledge is crucial
  • We questioned everything, including equation of
    state
  • Surprising results a new equation of state, and
    a new style of coding CFD models that coalesces
    loops and is significantly faster on modern
    computers
  • Nothing worked until Professor Blumberg
  • Conveyed the analytics of the core routines
  • Designed suitable test cases
  • Debugged problems

5
Doubling Performance of Barotropic Mode 10
months 2.5 hours
  • Blindly optimizing the code is very difficult
  • It took time to understand the model
  • Problems
  • Validating results
  • Finding the right test case
  • Incremental test procedure
  • Once the correct strategy was in place, it took
    only 2.5 hours to double the performance of
    barotropic mode
  • More optimizations are possible, but they get
    exponentially more difficult
  • We propose not to do them by hand

6
A Sample Problem
7
Computer Architecture
  • Locality Speed!
  • Registers
  • Different Levels of cache
  • Main Memory
  • Swapping
  • Disk
  • Sequential access is much faster than
    non-sequential
  • Locality is also accuracy
  • Multiple execution units
  • The limiting factor today is memory bandwidth
  • ECOMSED and POM are both memory limited

8
Efficiency
  • There is a cost structure to operations
  • Changes over time
  • Todays Highest Costs
  • Writing to memory
  • Cache must flush to memory
  • Reading from memory
  • Statistically, cached

exp log sqrt Division Multiplication
9
Heuristics for Writing Fast Code
  • Efficiency doesn't hurt
  • It may not help
  • Minimize cost of operations
  • Different on each machine
  • But similar enough that heuristics will produce
    good results on most architectures
  • On average, performance will improve if
    expressions are sufficiently large

10
Example Computing Bottom Stress original
  • do j2,jmm1
  • do i2,imm1
  • ubar(i,j,kbm1)0.5(ua(i,j)ua(i1,j))
  • vbar(i,j,kbm1)0.5(va(i,j)va(i,j1))
  • enddo
  • enddo
  • do j2,jmm1
  • do i2,imm1
  • qbar(i,j)sqrt(ubar(i,j,kbm1)ubar(i,j,kbm
    1)
  • vbar(i,j,kbm1)vbar(i,j,kbm
    1))
  • enddo
  • enddo
  • do j2,jmm1
  • do i2,imm1
  • if (fsm(i,j).gt.0.0) then
  • tau(i,j,kb)10000.cbc(i,j)/
  • varbf(i,j)qbar(i,j)qbar(i,j)
  • endif

11
Bottom Stress 21 Times Faster
  • do j2,jmm1
  • do i2,imm1
  • tempuBAR (UA(I,J)UA(I1,J))
  • tempVBAR (VA(I,J)VA(I,J1))
  • TAU(I,J,KB)CBC2(I,J)(tempUBAR2
    tempVBAR2)
  • enddo
  • enddo

12
The Current State of Compiler Optimization
  • Compilers typically do not rearrange floating
    point expressions
  • For fear of subtle roundoff effects
  • Instruction scheduling can still be done as long
    as the two data streams do not interact
  • Example (x y) - (x / y) (x y) (x y)
  • Common subexpressions are well-handled provided
    they are in the same form
  • Algebraic equivalences are not considered
  • Constant subexpressions are pre-evaluated at
    compile time in C and FORTRAN

13
Examples Writing Fast Code
  • Only scalars are considered constant
  • Writing to an array is not optimized
    awayvar1(i,j,k) c1 c2 c3var1(i,j,k)
    var1(i,j,k) c4var1(i,j,k) c1 c2 c3
    c4c1 var2(i,j,k) c3 var3(i,j,k)
    c4c1c3c4 var2(i,j,k) var3(i,j,k)var2(i,j,k
    ) var3(i,j,k) c1c3c4var2(i,j,k)
    var3(i,j,k) (c1c3c4)var1(i,j,k) / c1(i,j,k)

14
Skipping Land
  • Goals
  • Save CPU by only processing water boxes
  • Cost very little CPU for all-water problems
  • IF-tests slow down the CPU, so preferably avoid
    them
  • Stripwise unstructured view of the grid avoids
    additional tests
  • inefficient on a vector machine
  • k-order would be better to test an entire water
    column at once, but this would require
    re-engineering the entire model

15
Skipping land, continued
istart,jrow, iend
2,2,2 5,2,6 2,3,2 5,3,6 2,4,6 2,5,6 2,6,6
istart,jrow, iend
16
High Performance Advection
  • Scalar Fluxes in one preferred direction
  • No memory access for flux
  • Full machine precision

xfluxi
Xfluxi1
Var(i,j,k)xFluxiP1 xFluxixFluxi xFluxiP1
17
Advection, cont
  • yfluxJ(1) 0
  • yFluxJ(im) 0
  • diffusiveyfluxJ(1) 0
  • diffusiveyFluxJ(im) 0
  • do k2,kbm1
  • do i2,imm1
  • yfluxJ(i) 0 ! because of land
    boundary condition
  • diffusiveYFluxJ(i) 0
  • enddo
  • do j2,jmm1
  • xfluxI 0 ! because of land
    boundary condition
  • do i2,imm1
  • xfluxIP1 (f(i,j,k)f(i1,j,k))
    xmfl3d(i1,j,k)
  • yFluxJP1 (f(i,j,k)f(i,j1,k))
    ymfl3d(i,j1,k)
  • diffusiveXFluxIP1 -aamx(i,j,k)
    addDTi(i,j,BACK)
  • (fb(i,j,k)-fb(i-1,j,k))uMaskedH2_2H1(i,
    j)
  • diffusiveYFluxJP1 -aamy(i,j,k)
    addDTj(i,j, BACK)
  • (fb(i,j,k)-fb(i,j-1,k))vMaskedH1_2H2(i,
    j)

18
Advection, take 3
  • ff(i,j,k)
  • (
  • fB(i,j,k) volT(i,j,BACK) -
    !advective part
  • dti2 0.5 invDZ(k)
  • (
  • (f(i,j,k-1)f(i,j,k))w(i,j,k
    ) -
  • (f(i,j,k)f(i,j,k1))w(i,j,k
    1)ART(i,j)
  • xFluxIP1 - xFluxI yFluxJP1
    - yFluxJ(i)
  • ) ! diffusive part
  • diffusiveXfluxIP1 -
    diffusiveXFluxI
  • diffusiveYFluxJP1 -
    diffusiveYFluxJ(i)
  • )
  • invVolT(i,j,FUTURE)
  • xFluxI xFluxIp1
  • diffusiveXFluxI diffusiveXFluxIp1
  • yFluxJ(i) yFluxJP1
  • diffusiveYFluxJ(i)diffusiveYFluxJP1
  • enddo
  • enddo

19
K-order is Faster
  • POM and ECOMSED are currently stored with I
    varying most frequently
  • var(i,j,k)
  • For most algorithms, traversal order is
    irrelevant, except
  • Vertical models are naturally ordered top to
    bottom
  • A model stored and accessed in k-order can
    implement the vertical profile algorithms much
    more efficiently

20
Performance of PROFT/PROFS
  • On a sample problem
  • 3 times faster for profiling temperature
  • 8 times faster for profiling salinity
  • This includes the penalty of traversing in the
    wrong memory order for T,S
  • For a k-ordered model, code would be even faster
  • Techniques as discussed before, and also
  • Removal of exponentiation where unnecessary

21
Shared Memory Parallelism
  • Compiler automatically recognizes opportunities
    for multiple CPUs to split up loops
  • Can be
  • Automatic (under compiler control)
  • Manual (OpenMP)
  • Problem is memory bandwidth

22
Memory Bandwidth and Shared Memory Computers
  • Different levels
  • Commodity PC dual processors
  • Mid-range shared-memory computers
  • Larger cache
  • Floating point performance of CPUs are critical
  • Special-purpose machines with exotic memory and
    caches
  • Write-back cache synchronization

23
MPI
  • Message Passing Interface requires manual coding
    intervention
  • The programmer can split the domain across
    multiple computers
  • Messages are passed exchanging information
    between models
  • This will not speed up small models due to the
    overhead of message passing
  • The volume of the model must be large with
    respect to the data exchange interface

24
Accuracy
  • Values in registers retain extra bits of
    precision
  • This can substantially improve computational
    results
  • Every time values are stored to single precision
    arrays, this extra information is lost
  • Example Vectorization on the Pentium 4
  • Very fast (4 simultaneous 32-bit operations)
  • Values change substantially in the model
  • Loss of significant figures

25
Simulation is not Prediction
  • Roundoff error slowly burns away digits of
    precision
  • No escape from information entropy
  • Roundoff is indistinguishable from small
    differences in the initial conditions
  • Single precision is not enough
  • After 12 hours, salinity results can differ by
    3ppt on different computers

26
Calibration of Model Run on Different Machines?
  • Calibration represents balancing all the
    constituents of the equations as they are
    computed by the model
  • Some parts of the model are more sensitive to
    roundoff error than others
  • Some parts are more driven by forcing functions
    and are therefore more stable
  • Turbulence closure is particularly twitchy

27
Density Computation
  • The Equation of State has been defined by Millero
    and Poisson in UNESCO
  • Accuracy is high
  • On the order of 6 digits
  • Original data (gt500 samples) 5 digits
  • Computational cost is also high
  • Many terms including s1.5
  • How much accuracy is needed?

28
Comparison Fofonoff 1952 vs. UNESCO
  • ECOMSED used Fofonoff 1952 until recently
  • Approximately 0.6 difference at worst case at
    approximate 10C, 10ppt
  • In computer model, the absolute density is
    irrelevant
  • The density difference between adjacent cells
    drives the forcing
  • In a 20ppt difference
  • No significant difference in X1
  • Small differences in distribution of salt
    gradient between 0 and 20 ppt.

29
Difference between UNESCO and Fofonoff 1950
30
A Faster Algorithm densityDKAP
  • c1 6.38582008014815d-12
  • c2 -1.07670862741285d-09
  • c3 9.73156784811086d-08
  • c4 -9.0042779076007d-06
  • c5 6.60339902342078d-05
  • c6 -0.000132403263360079d0
  • c7 4.91156586724696d-12
  • c8 -7.81209323957205d-10
  • c9 6.72419650942208d-08
  • c10 -3.5945086374423d-06
  • c11 0.000809532551425d0
  • c12 -1.59139276805119d-07
  • c13 -2.00957653399204d-12
  • c14 1.28333259714417d-10
  • c15 2.34587013019438d-09
  • do k1,kb-1
  • do j2,jm-1
  • do i2,im-1
  • rho(i,j,k)
  • ((((c13 t(i,j,k) c14) t(i,j,k)
    c15)
  • s(i,j,k) c12) s(i,j,k)
  • ((((c7 t(i,j,k) c8) t(i,j,k)) c9)
  • t(i,j,k) c10) t(i,j,k) c11)
    s(i,j,k)
  • ((((c1 t(i,j,k) c2) t(i,j,k) c3)
  • t(i,j,k)c4) t(i,j,k) c5) t(i,j,k)
    c6
  • enddo
  • enddo
  • enddo

31
UNESCO, DKAP and MS
  • Dr. Martin Senator of Davidson Laboratory
    contributed a different fit, MSSuperior to
    UNESCO in fit to the original data

32
Consistent Attention to Accuracy
  • In POM and ECOMSED, density computation is
    performed
  • Rho is computed for each cell
  • In POM, density is stored as a ratio from rhoref
    to save digits
  • Differences are subtracted between adjacent cells
  • Salinity difference of 0.001 ppt yields a result
    in the 7th decimal place
  • Temperature difference of 0.01 C yields a
    result in the 7th decimal place
  • Result 0 digits accuracy
  • Conclusion Compute density in double precision.
    Anything else may be inaccurate

33
Double precision Density vs. Single Precision
  • Double precision is more stable
  • Differential Calculation of density will be
    presumably much more accurate, and therefore even
    more stable

34
Calculate Density Difference Analytically
  • For greatest accuracy, calculate differential
    density directly
  • Slightly slower, but full accuracy
  • Density(s,t) Pn(s,t)
  • DiffDensity(s,t,ds,dt)

35
Future Directions in Hardware
  • New processors with more registers
  • Bigger expressions become even more advantageous
  • Improved shared memory performance
  • Improved writeback cache design, larger caches,
    faster interconnects
  • Intelligent memory subsystems preloading values
  • Sequential access becomes even more important

36
Improved Software
  • We propose to write a Model Construction Toolkit
    (MCT) that will do all the optimizations
    mentioned here, and more
  • Reimplement POM/ECOMSED using the new toolkit
  • Allow calling FORTRAN routines
  • Enter algorithms either in a FORTRAN-like syntax
    or in a higher-level syntax based on common
    notation for difference equations
  • Automatically optimize expressions
  • Compute array subexpressions only when changed
  • Compute results using an optimal function given
    the situation
  • The result More robust, reliable modeling, at
    much higher speed, with less effort

37
Acknowledgements
  • Professor Roger Pinkham, for superb numerical and
    statistical knowledge and insight
  • Dr. Martin Senator for his densityMS fit and help
    getting started with R
  • The makers of R, an incredible statistics package
  • John Gilson, who reviewed the presentation and
    helped us get the right computing focus, and
    whose discussions first got us started thinking
    about a model construction toolkit
Write a Comment
User Comments (0)
About PowerShow.com