Title: Optimizing Performance of the Lattice Boltzmann Method for Complex Structures
1Optimizing Performance of the Lattice Boltzmann
Method for Complex Structures
- Friedrich-Alexander University Erlangen/Nuremberg
- Department of Computer Science 10 (System
Simulation) - Regional Computing Center of Erlangen (RRZE)
2Outline
- Introduction
- Lattice Boltzmann Method
- Implementation Aspects
- Application
- Implementation
- Optimization
- Results
- Conclusion
3Lattice Boltzmann Method
- Boltzmann Equation
- Discretization of particle velocity space
- (finite set of discrete velocities)
4Lattice Boltzmann Method
- Different discretization schemes
- Numerical accuracy and stability
- Computational speed and simplicity
D3Q15
D3Q19
D3Q27
5Lattice Boltzmann Method
- Discretization in space x and time t
collision step
streaming step
6Lattice Boltzmann Method (Implementation Aspects)
- Discretization in space x and time t
collision step
streaming step
- Stream-Collide (Pull-Method)
- Get the distributions from the neighboring cells
in the source array and store the relaxated
values to one cell in the destination array - Collide-Stream (Push-Method)
- Take the distributions from one cell in the
source array and store the relaxated values to
the neighboring cells in the destination array
W
source
destination
7Lattice Boltzmann Method (Implementation Aspects)
- Walls and Obstacles Bounce Back rule
8Implementation Aspects
- Data Dependencies
- Two Grids
- Compressed Grid
?
9Implementation Aspects
double precision f(0xMax1,0yMax1,0zMax1,018
,01) do z1,zMax do y1,yMax do
x1,xMax if( fluidcell(x,y,z) ) then
LOAD f(x,y,z, 018,t) Relaxation (complex
computations) SAVE f(x ,y ,z ,
0,t1) SAVE f(x1,y1,z , 1,t1)
SAVE f(x ,y1,z , 2,t1) SAVE
f(x-1,y1,z , 3,t1) SAVE f(x
,y-1,z-1,18,t1) endif enddo
enddo enddo
Collide
Stream
10Application
11Application Porous Media Combustion
P C Porous Media Combustion
- New technology for heating installations
- Porous Media Combustion
- Fuel-air-mixture does no longer react in a free
flame - Combustion process takes place inside the pores
of a porous medium that is placed in the reaction
area
60mm
20mm
12Application Porous Media Combustion
P C Porous Media Combustion
Figures by courtesy of LSTM Uni-Erlangen, Thomas
Zeiser
13Application Porous Media Combustion
P C Porous Media Combustion
- Various applications
- modern steam engines, vehicle heaters
Figures by courtesy of LSTM Uni-Erlangen, Thomas
Zeiser
14Application Introducing Complex Geometries used
- Porous Medium from PMC Silicon-Carbide (SiC)
- Many small obstacles
- Obstacle/fluid-ratio 2
- High number of fluid-solid faces
15Application Introducing Complex Geometries used
- Second Test-Geometry
- MC
- Huge obstacles, only fewfluid tubes
- Obstacle/fluid-ratio 50
- Low number of fluid-solid faces
16Implementation
17Implementation
- Collision and streaming step in same loop
- Push-Method
- Data representation in 1D-Array
- Stores only Fluid Cells (? saves memory)
- Indirect addressing of target cells (by extra
connectivity array) - Boundary conditions (Bounce Back) handled
implicitly
18Implementation
- Indirect addressing and implicit Bounce Back
obstacle wall
i
i1
i
i1
i-1
i-1
connectivity array
19Implementation
- Indirect addressing and implicit Bounce Back
obstacle wall
i
i1
i
i1
i-1
i-1
connectivity array
20Implementation
- Preprocessor
- Sets up connectivity array
- Specifies all domain parameters
- Geometry and obstacles
- Traversing scheme
- Solver
- Reads in preprocessed information
- Performs lattice Boltzmann method (in single loop)
21Implementation
- Rating
- 1D-Array compared to standard implementation
using multidimensional array and three loops - Advantages
- Saves memory
- The more obstacles in the domain the higher is
the compression - Implicit Bounce Back
- No extra routine or if-statement needed
- Drawbacks
- Indirect addressing
- Prevents compiler from vectorization and other
optimization techniques - ? Consequently, worse performance
22Optimization - Outline
- Memory Traversing Schemes
- Space-Filling Curves
- Blocking
- Memory Layouts
- Further Optimization Techniques
23Memory Traversing Schemes Space-Filling Curves
- What is a space-filling curve?
- Loosely spoken
- A one dimensional curve that fills a higher
dimensional space - Which curves were used?
- Hilbert
- Peano
- How are they constructed?
- Again, loosely
- By a mapping from a one-dimensional interval to
the higher dimensional space - Then, by recursion new mapping of each part of
the interval
24Memory Traversing Schemes Space-Filling Curves
- And how works construction really?
1
0
25Memory Traversing Schemes Space-Filling Curves
- How does that look in 3D?
- OK, but how to construct them?
26Memory Traversing Schemes Space-Filling Curves
- How to construct them?
- Table based approach
- Hilbert 48 Productions with 8 entries and 7
connectors - Peano 8 Productions with 27 entries and 26
connectors
27Memory Traversing Schemes Space-Filling Curves
- Summary for Space-Filling Curves
- Recursive production by segmentation
- Limitation in system sizes
- Hilbert 23n
- Peano 33n
- Increase spatial locality
- Enable mesh-refinement
28Memory Traversing Schemes Blocking
- Implicit blocking technique
- Arrangement of data in a blocking manner
- Increases spatial locality
29Memory Traversing Schemes
- Notes on Memory Traversing Schemes
- Pure preprocessing technique
- Only arrangement of data in memory changed
- No change for solver
- No overhead for solver (e.g. loop overhead for
blocking)
30Memory Layouts Collision Optimized Layout
- Standard Array-Layout F(i,x,y,z,t)
Array-of-Structures - Collision optimized
- Optimal read access2 cache lines per LUP
- Bad write access19 stores in 19 cache
linesBut Depending on systemsize some of them
arealready/still in cache
31Memory Layouts Collision Optimized Layout
- 18 write accesseson one cell from3 different
z-layers - 8 write accesseson one cell from3 different rows
32Memory Layouts Collision Optimized Layout
- Performance of Collision-Optimized Layout (P4,
512kB L2)
33Memory Layouts Propagation Optimized Layout
- Optimized Array-Layout F(x,y,z,i,t)
Structure-of-Arrays - stride-1-access on x (inner loop)
- 19 cache lines per 16 LUPsin read and write
process - 1 cache miss each 16th memory access
34Memory Layouts Propagation Optimized Layout
- Performance on Pentium 4, 512kB L2
35Further Optimization Techniques
- Additional Bottlenecks
- Large loop body (causes register spills on IA32)
- Concurrent writing to 19 different cache lines
interferes with number of write combine buffers
on IA32 (6 for Intel Xeon/Nocona) - Indirect addressing prevents IA32 hardware
prefetcher from preloading values for target
cells (due to bounce back at obstacles) - Solutions (implemented in Solver)
- Split up loop in 5 loops of length Nx
- Manual Block Preload Technique
- (Drawback Both techniques need a loop blocking
scheme) - These solutions only needed for IA32
36Results - Outline
- Architecture Descriptions
- Comparison 1D-Solver to Standard Solver
- Memory Layouts
- For Standard Solver
- For 1D-Solver
- Influence of Geometry
- MC
- SiC
- Memory Traversing Schemes
- Space-Filling Curves
- Blocking
37Architecture Description Nocona/Irwindale
- Test System Test machine at RRZE
- CPUs Nocona (Irwindale), 3.6GHz, 2MB L2-Cache
- Memory DDR400, 6.4GB/s
- Architectural specialties
- EM64T extension
- Hyperthreading
- One memory bus for both processors
38Architecture Description Itanium2 / Altix
- Test System RRZE SGI Altix
- CPUs Itanium 2, 1.3 GHz, 3MB L3-Cache
- Memory 112GB distributed shared memory
- Architectural specialties
- Itanium 2
- EPIC (Explicitly Parallel Instruction
Computing) - No out-of-order
- Parallelization of commands in the grip of
compiler (? bundles) - L1-Cache only for Integer
- Altix
- ccNUMA with NUMALink 3
- Memory connected hierarchically by SHUBs
39Architecture Description AMD Opteron
- Test System LSS HPC-Cluster
- CPUs AMD Opteron, 2.2GHz, 1MB L2-CacheIA32-compa
tible - Memory DDR333, 5.2GB/s
- Architectural specialties
- Compute nodes with four CPUs
- 4GB RAM per CPU, each CPU can access 16GB per
ccNUMA - Interconnect
- CPUs on one node HyperTransport (6.4GB/s)
- Nodes Infiniband (10GB/s)
40Comparison 1D-Array Solver to Standard Solver
41Memory Layouts for Standard Solver
42Memory Layouts on Itanium 2
43Memory Layouts on AMD Opteron
44Memory Layouts on Nocona/Irwindale
45Influence of Geometry SiC-foam (2 obstacles)
46Influence of Geometry MC (50 obstacles)
47Memory Traversing Schemes
48Memory Traversing Schemes
49Conclusion
- 1D-Array Data representation makes performance
independent of obstacle to fluid ratio - Memory traversing by Space-Filling Curves results
in similar performance as spatial blocking - Implementation of SFCs is not worth the effort
(if they are used as memory traversing
alternative only) - Together with indirect addressing Collision
Optimized Layout with blocking is best technique
if cache is larger than 1 MB - Indeed, there are cases where Propagation
Optimized Layout is not best
50Outlook
- Future work could concern
- Space-Filling Curves
- Kind of staggered SFCs, for every direction own
curve - Avoid waste of underused cache lines where
lattice sites are neighboring cells which are
visited much later - Galerkin-discretization or point wise evaluation
of LBM to enable stack-implementation in
conjunction with SFCs - BUT For real-world problems construction on
non-cubic grids is necessary at first - Search for vectorization enhancing techniques to
over-come problems with indirect addressing on
Itanium 2 - Search for reasons why Collision Optimized Layout
is better than Propagation Optimized Layout
51Acknowledgement / References
- Acknowledgement
- Bavarian Graduate School for Computational
Engineering - Thomas Zeiser (RRZE)
- Gerhard Wellein (RRZE)
- References
- S. Donath, T. Zeiser, G. Hager, J. Habich, G.
Wellein Optimizing Performance of the Lattice
Boltzmann Method for Complex Structures on
Cache-based Architectures - G. Wellein, P. Lammers, G. Hager, S. Donath, T.
Zeiser Towards Optimal Performance for Lattice
Boltzmann Applications on Terascale Computers - G. Wellein, T. Zeiser, S. Donath, G. Hager On
the single processor performance of simple
lattice Boltzmann kernels