Optimizing Performance of the Lattice Boltzmann Method for Complex Structures - PowerPoint PPT Presentation

1 / 51
About This Presentation
Title:

Optimizing Performance of the Lattice Boltzmann Method for Complex Structures

Description:

Traversing scheme. Solver. Reads in preprocessed information ... Memory Traversing Schemes: Space-Filling Curves. And how works construction really? ... – PowerPoint PPT presentation

Number of Views:163
Avg rating:3.0/5.0
Slides: 52
Provided by: www5I
Category:

less

Transcript and Presenter's Notes

Title: Optimizing Performance of the Lattice Boltzmann Method for Complex Structures


1
Optimizing Performance of the Lattice Boltzmann
Method for Complex Structures
  • Friedrich-Alexander University Erlangen/Nuremberg
  • Department of Computer Science 10 (System
    Simulation)
  • Regional Computing Center of Erlangen (RRZE)

2
Outline
  • Introduction
  • Lattice Boltzmann Method
  • Implementation Aspects
  • Application
  • Implementation
  • Optimization
  • Results
  • Conclusion

3
Lattice Boltzmann Method
  • Boltzmann Equation
  • Discretization of particle velocity space
  • (finite set of discrete velocities)

4
Lattice Boltzmann Method
  • Different discretization schemes
  • Numerical accuracy and stability
  • Computational speed and simplicity

D3Q15
D3Q19
D3Q27
5
Lattice Boltzmann Method
  • Discretization in space x and time t

collision step
streaming step
6
Lattice Boltzmann Method (Implementation Aspects)
  • Discretization in space x and time t

collision step
streaming step
  • Stream-Collide (Pull-Method)
  • Get the distributions from the neighboring cells
    in the source array and store the relaxated
    values to one cell in the destination array
  • Collide-Stream (Push-Method)
  • Take the distributions from one cell in the
    source array and store the relaxated values to
    the neighboring cells in the destination array

W
source
destination
7
Lattice Boltzmann Method (Implementation Aspects)
  • Walls and Obstacles Bounce Back rule

8
Implementation Aspects
  • Data Dependencies
  • Two Grids
  • Compressed Grid

?
9
Implementation Aspects
double precision f(0xMax1,0yMax1,0zMax1,018
,01) do z1,zMax do y1,yMax do
x1,xMax if( fluidcell(x,y,z) ) then
LOAD f(x,y,z, 018,t) Relaxation (complex
computations) SAVE f(x ,y ,z ,
0,t1) SAVE f(x1,y1,z , 1,t1)
SAVE f(x ,y1,z , 2,t1) SAVE
f(x-1,y1,z , 3,t1) SAVE f(x
,y-1,z-1,18,t1) endif enddo
enddo enddo
Collide
Stream
10
Application
11
Application Porous Media Combustion
P C Porous Media Combustion
  • New technology for heating installations
  • Porous Media Combustion
  • Fuel-air-mixture does no longer react in a free
    flame
  • Combustion process takes place inside the pores
    of a porous medium that is placed in the reaction
    area

60mm
20mm
12
Application Porous Media Combustion
P C Porous Media Combustion
Figures by courtesy of LSTM Uni-Erlangen, Thomas
Zeiser
13
Application Porous Media Combustion
P C Porous Media Combustion
  • Various applications
  • modern steam engines, vehicle heaters

Figures by courtesy of LSTM Uni-Erlangen, Thomas
Zeiser
14
Application Introducing Complex Geometries used
  • Porous Medium from PMC Silicon-Carbide (SiC)
  • Many small obstacles
  • Obstacle/fluid-ratio 2
  • High number of fluid-solid faces

15
Application Introducing Complex Geometries used
  • Second Test-Geometry
  • MC
  • Huge obstacles, only fewfluid tubes
  • Obstacle/fluid-ratio 50
  • Low number of fluid-solid faces

16
Implementation
17
Implementation
  • Collision and streaming step in same loop
  • Push-Method
  • Data representation in 1D-Array
  • Stores only Fluid Cells (? saves memory)
  • Indirect addressing of target cells (by extra
    connectivity array)
  • Boundary conditions (Bounce Back) handled
    implicitly

18
Implementation
  • Indirect addressing and implicit Bounce Back

obstacle wall
i
i1
i
i1
i-1
i-1
connectivity array
19
Implementation
  • Indirect addressing and implicit Bounce Back

obstacle wall
i
i1
i
i1
i-1
i-1
connectivity array
20
Implementation
  • Preprocessor
  • Sets up connectivity array
  • Specifies all domain parameters
  • Geometry and obstacles
  • Traversing scheme
  • Solver
  • Reads in preprocessed information
  • Performs lattice Boltzmann method (in single loop)

21
Implementation
  • Rating
  • 1D-Array compared to standard implementation
    using multidimensional array and three loops
  • Advantages
  • Saves memory
  • The more obstacles in the domain the higher is
    the compression
  • Implicit Bounce Back
  • No extra routine or if-statement needed
  • Drawbacks
  • Indirect addressing
  • Prevents compiler from vectorization and other
    optimization techniques
  • ? Consequently, worse performance

22
Optimization - Outline
  • Memory Traversing Schemes
  • Space-Filling Curves
  • Blocking
  • Memory Layouts
  • Further Optimization Techniques

23
Memory Traversing Schemes Space-Filling Curves
  • What is a space-filling curve?
  • Loosely spoken
  • A one dimensional curve that fills a higher
    dimensional space
  • Which curves were used?
  • Hilbert
  • Peano
  • How are they constructed?
  • Again, loosely
  • By a mapping from a one-dimensional interval to
    the higher dimensional space
  • Then, by recursion new mapping of each part of
    the interval

24
Memory Traversing Schemes Space-Filling Curves
  • And how works construction really?

1
0
25
Memory Traversing Schemes Space-Filling Curves
  • How does that look in 3D?
  • OK, but how to construct them?

26
Memory Traversing Schemes Space-Filling Curves
  • How to construct them?
  • Table based approach
  • Hilbert 48 Productions with 8 entries and 7
    connectors
  • Peano 8 Productions with 27 entries and 26
    connectors

27
Memory Traversing Schemes Space-Filling Curves
  • Summary for Space-Filling Curves
  • Recursive production by segmentation
  • Limitation in system sizes
  • Hilbert 23n
  • Peano 33n
  • Increase spatial locality
  • Enable mesh-refinement

28
Memory Traversing Schemes Blocking
  • Implicit blocking technique
  • Arrangement of data in a blocking manner
  • Increases spatial locality

29
Memory Traversing Schemes
  • Notes on Memory Traversing Schemes
  • Pure preprocessing technique
  • Only arrangement of data in memory changed
  • No change for solver
  • No overhead for solver (e.g. loop overhead for
    blocking)

30
Memory Layouts Collision Optimized Layout
  • Standard Array-Layout F(i,x,y,z,t)
    Array-of-Structures
  • Collision optimized
  • Optimal read access2 cache lines per LUP
  • Bad write access19 stores in 19 cache
    linesBut Depending on systemsize some of them
    arealready/still in cache

31
Memory Layouts Collision Optimized Layout
  • 18 write accesseson one cell from3 different
    z-layers
  • 8 write accesseson one cell from3 different rows

32
Memory Layouts Collision Optimized Layout
  • Performance of Collision-Optimized Layout (P4,
    512kB L2)

33
Memory Layouts Propagation Optimized Layout
  • Optimized Array-Layout F(x,y,z,i,t)
    Structure-of-Arrays
  • stride-1-access on x (inner loop)
  • 19 cache lines per 16 LUPsin read and write
    process
  • 1 cache miss each 16th memory access

34
Memory Layouts Propagation Optimized Layout
  • Performance on Pentium 4, 512kB L2

35
Further Optimization Techniques
  • Additional Bottlenecks
  • Large loop body (causes register spills on IA32)
  • Concurrent writing to 19 different cache lines
    interferes with number of write combine buffers
    on IA32 (6 for Intel Xeon/Nocona)
  • Indirect addressing prevents IA32 hardware
    prefetcher from preloading values for target
    cells (due to bounce back at obstacles)
  • Solutions (implemented in Solver)
  • Split up loop in 5 loops of length Nx
  • Manual Block Preload Technique
  • (Drawback Both techniques need a loop blocking
    scheme)
  • These solutions only needed for IA32

36
Results - Outline
  • Architecture Descriptions
  • Comparison 1D-Solver to Standard Solver
  • Memory Layouts
  • For Standard Solver
  • For 1D-Solver
  • Influence of Geometry
  • MC
  • SiC
  • Memory Traversing Schemes
  • Space-Filling Curves
  • Blocking

37
Architecture Description Nocona/Irwindale
  • Test System Test machine at RRZE
  • CPUs Nocona (Irwindale), 3.6GHz, 2MB L2-Cache
  • Memory DDR400, 6.4GB/s
  • Architectural specialties
  • EM64T extension
  • Hyperthreading
  • One memory bus for both processors

38
Architecture Description Itanium2 / Altix
  • Test System RRZE SGI Altix
  • CPUs Itanium 2, 1.3 GHz, 3MB L3-Cache
  • Memory 112GB distributed shared memory
  • Architectural specialties
  • Itanium 2
  • EPIC (Explicitly Parallel Instruction
    Computing)
  • No out-of-order
  • Parallelization of commands in the grip of
    compiler (? bundles)
  • L1-Cache only for Integer
  • Altix
  • ccNUMA with NUMALink 3
  • Memory connected hierarchically by SHUBs

39
Architecture Description AMD Opteron
  • Test System LSS HPC-Cluster
  • CPUs AMD Opteron, 2.2GHz, 1MB L2-CacheIA32-compa
    tible
  • Memory DDR333, 5.2GB/s
  • Architectural specialties
  • Compute nodes with four CPUs
  • 4GB RAM per CPU, each CPU can access 16GB per
    ccNUMA
  • Interconnect
  • CPUs on one node HyperTransport (6.4GB/s)
  • Nodes Infiniband (10GB/s)

40
Comparison 1D-Array Solver to Standard Solver
41
Memory Layouts for Standard Solver
42
Memory Layouts on Itanium 2
43
Memory Layouts on AMD Opteron
44
Memory Layouts on Nocona/Irwindale
45
Influence of Geometry SiC-foam (2 obstacles)
46
Influence of Geometry MC (50 obstacles)
47
Memory Traversing Schemes
48
Memory Traversing Schemes
49
Conclusion
  • 1D-Array Data representation makes performance
    independent of obstacle to fluid ratio
  • Memory traversing by Space-Filling Curves results
    in similar performance as spatial blocking
  • Implementation of SFCs is not worth the effort
    (if they are used as memory traversing
    alternative only)
  • Together with indirect addressing Collision
    Optimized Layout with blocking is best technique
    if cache is larger than 1 MB
  • Indeed, there are cases where Propagation
    Optimized Layout is not best

50
Outlook
  • Future work could concern
  • Space-Filling Curves
  • Kind of staggered SFCs, for every direction own
    curve
  • Avoid waste of underused cache lines where
    lattice sites are neighboring cells which are
    visited much later
  • Galerkin-discretization or point wise evaluation
    of LBM to enable stack-implementation in
    conjunction with SFCs
  • BUT For real-world problems construction on
    non-cubic grids is necessary at first
  • Search for vectorization enhancing techniques to
    over-come problems with indirect addressing on
    Itanium 2
  • Search for reasons why Collision Optimized Layout
    is better than Propagation Optimized Layout

51
Acknowledgement / References
  • Acknowledgement
  • Bavarian Graduate School for Computational
    Engineering
  • Thomas Zeiser (RRZE)
  • Gerhard Wellein (RRZE)
  • References
  • S. Donath, T. Zeiser, G. Hager, J. Habich, G.
    Wellein Optimizing Performance of the Lattice
    Boltzmann Method for Complex Structures on
    Cache-based Architectures
  • G. Wellein, P. Lammers, G. Hager, S. Donath, T.
    Zeiser Towards Optimal Performance for Lattice
    Boltzmann Applications on Terascale Computers
  • G. Wellein, T. Zeiser, S. Donath, G. Hager On
    the single processor performance of simple
    lattice Boltzmann kernels
Write a Comment
User Comments (0)
About PowerShow.com