Optimizing Performance of the Lattice Boltzmann Method for Complex Structures - PowerPoint PPT Presentation

1 / 51
About This Presentation
Title:

Optimizing Performance of the Lattice Boltzmann Method for Complex Structures

Description:

Optimizing Performance of the Lattice Boltzmann Method for Complex Structures Friedrich-Alexander University Erlangen/Nuremberg Department of Computer Science 10 ... – PowerPoint PPT presentation

Number of Views:201
Avg rating:3.0/5.0
Slides: 52
Provided by: feri73
Category:

less

Transcript and Presenter's Notes

Title: Optimizing Performance of the Lattice Boltzmann Method for Complex Structures


1
Optimizing Performance of the Lattice Boltzmann
Method for Complex Structures
  • Friedrich-Alexander University Erlangen/Nuremberg
  • Department of Computer Science 10 (System
    Simulation)
  • Regional Computing Center of Erlangen (RRZE)

2
Outline
  • Introduction
  • Lattice Boltzmann Method
  • Implementation Aspects
  • Application
  • Implementation
  • Optimization
  • Results
  • Conclusion

3
Lattice Boltzmann Method
  • Boltzmann Equation
  • Discretization of particle velocity space
  • (finite set of discrete velocities)

4
Lattice Boltzmann Method
  • Different discretization schemes
  • Numerical accuracy and stability
  • Computational speed and simplicity

D3Q15
D3Q19
D3Q27
5
Lattice Boltzmann Method
  • Discretization in space x and time t

collision step
streaming step
6
Lattice Boltzmann Method (Implementation Aspects)
  • Discretization in space x and time t

collision step
streaming step
  • Stream-Collide (Pull-Method)
  • Get the distributions from the neighboring cells
    in the source array and store the relaxated
    values to one cell in the destination array
  • Collide-Stream (Push-Method)
  • Take the distributions from one cell in the
    source array and store the relaxated values to
    the neighboring cells in the destination array

W
source
destination
7
Lattice Boltzmann Method (Implementation Aspects)
  • Walls and Obstacles Bounce Back rule

8
Implementation Aspects
  • Data Dependencies
  • Two Grids
  • Compressed Grid

?
9
Implementation Aspects
double precision f(0xMax1,0yMax1,0zMax1,018
,01) do z1,zMax do y1,yMax do
x1,xMax if( fluidcell(x,y,z) ) then
LOAD f(x,y,z, 018,t) Relaxation (complex
computations) SAVE f(x ,y ,z ,
0,t1) SAVE f(x1,y1,z , 1,t1)
SAVE f(x ,y1,z , 2,t1) SAVE
f(x-1,y1,z , 3,t1) SAVE f(x
,y-1,z-1,18,t1) endif enddo
enddo enddo
Collide
Stream
10
Application
11
Application Porous Media Combustion
P C Porous Media Combustion
  • New technology for heating installations
  • Porous Media Combustion
  • Fuel-air-mixture does no longer react in a free
    flame
  • Combustion process takes place inside the pores
    of a porous medium that is placed in the reaction
    area

60mm
20mm
12
Application Porous Media Combustion
P C Porous Media Combustion
Figures by courtesy of LSTM Uni-Erlangen, Thomas
Zeiser
13
Application Porous Media Combustion
P C Porous Media Combustion
  • Various applications
  • modern steam engines, vehicle heaters

Figures by courtesy of LSTM Uni-Erlangen, Thomas
Zeiser
14
Application Introducing Complex Geometries used
  • Porous Medium from PMC Silicon-Carbide (SiC)
  • Many small obstacles
  • Obstacle/fluid-ratio 2
  • High number of fluid-solid faces

15
Application Introducing Complex Geometries used
  • Second Test-Geometry
  • MC
  • Huge obstacles, only fewfluid tubes
  • Obstacle/fluid-ratio 50
  • Low number of fluid-solid faces

16
Implementation
17
Implementation
  • Collision and streaming step in same loop
  • Push-Method
  • Data representation in 1D-Array
  • Stores only Fluid Cells (? saves memory)
  • Indirect addressing of target cells (by extra
    connectivity array)
  • Boundary conditions (Bounce Back) handled
    implicitly

18
Implementation
  • Indirect addressing and implicit Bounce Back

obstacle wall
i
i1
i
i1
i-1
i-1
connectivity array
19
Implementation
  • Indirect addressing and implicit Bounce Back

obstacle wall
i
i1
i
i1
i-1
i-1
connectivity array
20
Implementation
  • Preprocessor
  • Sets up connectivity array
  • Specifies all domain parameters
  • Geometry and obstacles
  • Traversing scheme
  • Solver
  • Reads in preprocessed information
  • Performs lattice Boltzmann method (in single loop)

21
Implementation
  • Rating
  • 1D-Array compared to standard implementation
    using multidimensional array and three loops
  • Advantages
  • Saves memory
  • The more obstacles in the domain the higher is
    the compression
  • Implicit Bounce Back
  • No extra routine or if-statement needed
  • Drawbacks
  • Indirect addressing
  • Prevents compiler from vectorization and other
    optimization techniques
  • ? Consequently, worse performance

22
Optimization - Outline
  • Memory Traversing Schemes
  • Space-Filling Curves
  • Blocking
  • Memory Layouts
  • Further Optimization Techniques

23
Memory Traversing Schemes Space-Filling Curves
  • What is a space-filling curve?
  • Loosely spoken
  • A one dimensional curve that fills a higher
    dimensional space
  • Which curves were used?
  • Hilbert
  • Peano
  • How are they constructed?
  • Again, loosely
  • By a mapping from a one-dimensional interval to
    the higher dimensional space
  • Then, by recursion new mapping of each part of
    the interval

24
Memory Traversing Schemes Space-Filling Curves
  • And how works construction really?

1
0
25
Memory Traversing Schemes Space-Filling Curves
  • How does that look in 3D?
  • OK, but how to construct them?

26
Memory Traversing Schemes Space-Filling Curves
  • How to construct them?
  • Table based approach
  • Hilbert 48 Productions with 8 entries and 7
    connectors
  • Peano 8 Productions with 27 entries and 26
    connectors

Current Level Next Level Next Level Next Level Next Level Next Level Next Level Next Level Next Level Next Level Next Level Next Level Next Level Next Level Next Level Next Level
bne enb b ben n ben f fse e fse b bws s bws f wnf
fws swf f fsw w fsw b bes s bes f fne e fne b nwb
27
Memory Traversing Schemes Space-Filling Curves
  • Summary for Space-Filling Curves
  • Recursive production by segmentation
  • Limitation in system sizes
  • Hilbert 23n
  • Peano 33n
  • Increase spatial locality
  • Enable mesh-refinement

28
Memory Traversing Schemes Blocking
  • Implicit blocking technique
  • Arrangement of data in a blocking manner
  • Increases spatial locality

29
Memory Traversing Schemes
  • Notes on Memory Traversing Schemes
  • Pure preprocessing technique
  • Only arrangement of data in memory changed
  • No change for solver
  • No overhead for solver (e.g. loop overhead for
    blocking)

30
Memory Layouts Collision Optimized Layout
  • Standard Array-Layout F(i,x,y,z,t)
    Array-of-Structures
  • Collision optimized
  • Optimal read access2 cache lines per LUP
  • Bad write access19 stores in 19 cache
    linesBut Depending on systemsize some of them
    arealready/still in cache

31
Memory Layouts Collision Optimized Layout
  • 18 write accesseson one cell from3 different
    z-layers
  • 8 write accesseson one cell from3 different rows

32
Memory Layouts Collision Optimized Layout
  • Performance of Collision-Optimized Layout (P4,
    512kB L2)

33
Memory Layouts Propagation Optimized Layout
  • Optimized Array-Layout F(x,y,z,i,t)
    Structure-of-Arrays
  • stride-1-access on x (inner loop)
  • 19 cache lines per 16 LUPsin read and write
    process
  • 1 cache miss each 16th memory access

34
Memory Layouts Propagation Optimized Layout
  • Performance on Pentium 4, 512kB L2

35
Further Optimization Techniques
  • Additional Bottlenecks
  • Large loop body (causes register spills on IA32)
  • Concurrent writing to 19 different cache lines
    interferes with number of write combine buffers
    on IA32 (6 for Intel Xeon/Nocona)
  • Indirect addressing prevents IA32 hardware
    prefetcher from preloading values for target
    cells (due to bounce back at obstacles)
  • Solutions (implemented in Solver)
  • Split up loop in 5 loops of length Nx
  • Manual Block Preload Technique
  • (Drawback Both techniques need a loop blocking
    scheme)
  • These solutions only needed for IA32

36
Results - Outline
  • Architecture Descriptions
  • Comparison 1D-Solver to Standard Solver
  • Memory Layouts
  • For Standard Solver
  • For 1D-Solver
  • Influence of Geometry
  • MC
  • SiC
  • Memory Traversing Schemes
  • Space-Filling Curves
  • Blocking

37
Architecture Description Nocona/Irwindale
  • Test System Test machine at RRZE
  • CPUs Nocona (Irwindale), 3.6GHz, 2MB L2-Cache
  • Memory DDR400, 6.4GB/s
  • Architectural specialties
  • EM64T extension
  • Hyperthreading
  • One memory bus for both processors

38
Architecture Description Itanium2 / Altix
  • Test System RRZE SGI Altix
  • CPUs Itanium 2, 1.3 GHz, 3MB L3-Cache
  • Memory 112GB distributed shared memory
  • Architectural specialties
  • Itanium 2
  • EPIC (Explicitly Parallel Instruction
    Computing)
  • No out-of-order
  • Parallelization of commands in the grip of
    compiler (? bundles)
  • L1-Cache only for Integer
  • Altix
  • ccNUMA with NUMALink 3
  • Memory connected hierarchically by SHUBs

39
Architecture Description AMD Opteron
  • Test System LSS HPC-Cluster
  • CPUs AMD Opteron, 2.2GHz, 1MB L2-CacheIA32-compa
    tible
  • Memory DDR333, 5.2GB/s
  • Architectural specialties
  • Compute nodes with four CPUs
  • 4GB RAM per CPU, each CPU can access 16GB per
    ccNUMA
  • Interconnect
  • CPUs on one node HyperTransport (6.4GB/s)
  • Nodes Infiniband (10GB/s)

40
Comparison 1D-Array Solver to Standard Solver
41
Memory Layouts for Standard Solver
42
Memory Layouts on Itanium 2
43
Memory Layouts on AMD Opteron
44
Memory Layouts on Nocona/Irwindale
45
Influence of Geometry SiC-foam (2 obstacles)
46
Influence of Geometry MC (50 obstacles)
47
Memory Traversing Schemes
48
Memory Traversing Schemes
49
Conclusion
  • 1D-Array Data representation makes performance
    independent of obstacle to fluid ratio
  • Memory traversing by Space-Filling Curves results
    in similar performance as spatial blocking
  • Implementation of SFCs is not worth the effort
    (if they are used as memory traversing
    alternative only)
  • Together with indirect addressing Collision
    Optimized Layout with blocking is best technique
    if cache is larger than 1 MB
  • Indeed, there are cases where Propagation
    Optimized Layout is not best

50
Outlook
  • Future work could concern
  • Space-Filling Curves
  • Kind of staggered SFCs, for every direction own
    curve
  • Avoid waste of underused cache lines where
    lattice sites are neighboring cells which are
    visited much later
  • Galerkin-discretization or point wise evaluation
    of LBM to enable stack-implementation in
    conjunction with SFCs
  • BUT For real-world problems construction on
    non-cubic grids is necessary at first
  • Search for vectorization enhancing techniques to
    over-come problems with indirect addressing on
    Itanium 2
  • Search for reasons why Collision Optimized Layout
    is better than Propagation Optimized Layout

51
Acknowledgement / References
  • Acknowledgement
  • Bavarian Graduate School for Computational
    Engineering
  • Thomas Zeiser (RRZE)
  • Gerhard Wellein (RRZE)
  • References
  • S. Donath, T. Zeiser, G. Hager, J. Habich, G.
    Wellein Optimizing Performance of the Lattice
    Boltzmann Method for Complex Structures on
    Cache-based Architectures
  • G. Wellein, P. Lammers, G. Hager, S. Donath, T.
    Zeiser Towards Optimal Performance for Lattice
    Boltzmann Applications on Terascale Computers
  • G. Wellein, T. Zeiser, S. Donath, G. Hager On
    the single processor performance of simple
    lattice Boltzmann kernels
Write a Comment
User Comments (0)
About PowerShow.com