Scientific Computations on Modern Parallel Vector Systems - PowerPoint PPT Presentation

1 / 19

About This Presentation

Title:

Scientific Computations on Modern Parallel Vector Systems

Description:

Collision requires coefficients for local gridpoint only, no communication ... Collision routine rewritten: ... Visualization of grazing collision of two black ... – PowerPoint PPT presentation

Number of Views:44

Avg rating:3.0/5.0

Slides: 20

Provided by: leonid

Learn more at: https://crd.lbl.gov

Category:

more less

Transcript and Presenter's Notes

Title: Scientific Computations on Modern Parallel Vector Systems

1
Scientific Computations on Modern Parallel
Vector Systems

Leonid Oliker, Jonathan Carter, Andrew Canning,
John Shalf
Lawrence Berkeley National Laboratories
Stephane Ethier
Princeton Plasma Physics Laboratory
http//crd.lbl.gov/oliker

2
Overview

Superscalar cache-based architectures dominate
HPC market
Leading architectures are commodity-based SMPs
due to generality and perception of cost
effectiveness
Growing gap between peak sustained performance
is well known in scientific computing
Modern parallel vectors may bridge gap this for
many important applications
In April 2002, the Earth Simulator (ES) became
operational Peak ES performance gt all DOE and
DOD systems combined Demonstrated high
sustained performance on demanding scientific
apps
Conducting evaluation study of scientific
applications on modern vector systems
09/2003 MOU between ES and NERSC was
completedFirst visit to ES center December
8th-17th, 2003 (ES remote access not
available)First international team to conduct
performance evaluation study at ES
Examining best mapping between demanding
applications and leading HPC systems - one size
does not fit all

3
Vector Paradigm

High memory bandwidth
Allows systems to effectively feed ALUs (high
byte to flop ratio)
Flexible memory addressing modes
Supports fine grained strided and irregular data
access
Vector Registers
Hide memory latency via deep pipelining of memory
load/stores
Vector ISA
Single instruction specifies large number of
identical operations
Vector architectures allow for
Reduced control complexity
Efficiently utilize large number of computational
resources
Potential for automatic discovery of parallelism
However most effective if sufficient regularity
discoverable in program structure
Suffers even if small of code non-vectorizable
(Amdahls Law)

4
Architectural Comparison

Custom vector architectures have
High memory bandwidth relative to peak
Superior interconnect latency, point to point,
and bisection bandwidth
Overall ES appears as the most balanced
architecture, while Altix shows best
architectural balance among superscalar
architectures
A key balance point for vector systems is the
scalarvector ratio

5
Applications studied

LBMHD Plasma Physics 1,500 lines
grid based
Lattice Boltzmann approach for magneto-hydrodynami
cs
CACTUS Astrophysics 100,000 lines
grid based
Solves Einsteins equations of general relativity
PARATEC Material Science 50,000 lines
Fourier space/grid
Density Functional Theory electronic structures
codes
GTC Magnetic Fusion 5,000 lines
particle based
Particle in cell method for gyrokinetic
Vlasov-Poisson equation
Applications chosen with potential to run at
ultrascale
Computations contain abundant data parallelism
ES runs require minimum parallelization and
vectorization hurdles
Codes originally designed for superscalar systems
Ported onto single node of SX6, first multi-node
experiments performed at ESC

6
Plasma Physics LBMHD

LBMHD uses a Lattice Boltzmann method to model
magneto-hydrodynamics (MHD)
Performs 2D simulation of high temperature plasma
Evolves from initial conditions and decaying to
form current sheets
2D spatial grid is coupled to octagonal streaming
lattice
Block distributed over 2D processor grid

Current density decays of two cross-shaped
structures

Main computational components
Collision requires coefficients for local
gridpoint only, no communication
Stream values at gridpoints are streamed to
neighbors, at cell boundaries
information is exchanged via MPI
Interpolation step required between spatial and
stream lattices
Developed George Vahalas group College of
William and Mary, ported Jonathan Carter

7
LBMHD Porting Details
(left) octagonal streaming lattice coupled with
square spatial grid
(right) example of diagonal streaming vector
updating three spatial cells

Collision routine rewritten
For ES loop ordering switched so gridpoint loop
(1000 iterations) is inner rather than velocity
or magnetic field loops (10 iterations)
X1 compiler made this transformation
automatically multistreaming outer loop and
vectorizing (via strip mining) inner loop
Temporary arrays padded reduce bank conflicts
Stream routine performs well
Array shift operations, block copies, 3rd-degree
polynomial eval
Boundary value exchange
MPI_Isend, MPI_Irecv pairs
Further work plan to use ES "global memory" to
remove message copies

8
LBMHD Performance

ES achieves highest performance to date over
3.3 Tflops for P1024
X1 comparable absolute speed up to P64 (lower
peak)
But performs 1.5X slower at P256 (decreased
scalability)
CAF improved X1 to slightly exceed ES at P64 (up
to 4.70 Gflop/P)
ES is 44X, 16X, and 7X faster than Power3,
Power4, and Altix
Low CI (1.5) and high memory requirement (30GB)
hurt scalar performance
Altix best scalar due to high memory bandwidth,
fast interconnect

9
LBMHD on X1 MPI vs CAF

X1 well-suited for one-sided parallel languages
(globally addressable mem)
MPI hinders this feature and requires scalar tag
matching
CAF allows much simpler coding of boundary
exchange (array subscripting)
feq(ista-1,jstajend,1) feq(iend,jstajend,1)ip
rev,myrankj
MPI requires non-contiguous data copies into
buffer, unpacked at destination
Since communication about 10 of LBMHD, only
slight improvements
However, for P64 on 40962 performance degrades.
Tradeoffs
CAF reduced total message volume 3X (eliminates
user and system buffer copy)
But CAF used more numerous and smaller sized
message

10
Astrophysics CACTUS

Numerical solution of Einsteins equations from
theory of general relativity
Among most complex in physics set of coupled
nonlinear hyperbolic elliptic systems with
thousands of terms
CACTUS evolves these equations to simulate high
gravitational fluxes, such as collision of two
black holes

Visualization of grazing collision of two black
holes
Communication at boundariesExpect high parallel
efficiency

Evolves PDEs on regular grid using finite
differences
Uses ADM formulation domain decomposed into 3D
hypersurfaces for different slices of space along
time dimension
Exciting new field about to be born
Gravitational Wave Astronomy - fundamentally new
information about Universe
Gravitational waves Ripples in spacetime
curvature, caused by matter motion, causing
distances to change.
Developed at Max Planck Institute, vectorized by
John Shalf

11
CACTUS Performance

ES achieves fastest performance to date 45X
faster than Power3!
Vector performance related to x-dim (vector
length)
Excellent scaling on ES using fixed data size per
proc (weak scaling)
Scalar performance better on smaller problem size
(cache effects)
X1 surprisingly poor (4X slower than ES) - low
ratio scalarvector
Unvectorized boundary, required 15 of runtime on
ES and 30 on X1
lt 5 for the scalar version unvectorized code
can quickly dominate cost
Poor superscalar performance despite high
computational intensity
Register spilling due to large number of loop
variables
Prefetch engines inhibited due to multi-layer
ghost zones calculations

12
Material Science PARATEC

PARATEC performs first-principles quantum
mechanical total energy calculation using
pseudopotentials plane wave basis set
Density Functional Theory to calc structure
electronic properties of new materials
DFT calc are one of the largest consumers of
supercomputer cycles in the world

Induced current and chargedensity in
crystallized glycine

Uses all-band CG approach to obtain wavefunction
of electrons
33 3D FFT, 33 BLAS3, 33 Hand coded F90
Part of calculation in real space other in
Fourier space
Uses specialized 3D FFT to transform wavefunction
Computationally intensive - generally obtains
high percentage of peak
Developed Andrew Canning with Louie and Cohens
groups (UCB, LBNL)

13
PARATECWavefunction Transpose
(a)
(b)

Transpose from Fourier to real space
3D FFT done via 3 sets of 1D FFTs and 2
transposes
Most communication in global transpose (b) to
(c) little communication (d) to (e)
Many FFTs done at the same timeto avoid latency
issues
Only non-zero elements communicated/calculated
Much faster than vendor 3D-FFT

ES achieves fastest performance to date! Over
2Tflop/s on 1024 procs
Main advantage for this type of code is fast
interconnect system
X1 3.5X slower than ES (although peak is 50
higher)
Non-vectorizable code can be much more expensive
on X1 (321 vs 81)
Lower bisection bandwidth to computation ratio
Limited scalability due to increasing cost of
global transpose and reduced vector length
Plan to run larger problem size next ES visit
Scalar architectures generally perform well due
to high computational intensity
Power3, Power4, Alitx are 8X, 4X, 1.5X slower
than ES
Vector arch allow opportunity to simulate systems
not possible on scalar platforms

15
Magnetic Fusion GTC

Gyrokinetic Toroidal Code transport of thermal
energy (plasma microturbulence)
Goal magnetic fusion is burning plasma power
plant producing cleaner energy
GTC solves 3D gyroaveraged gyrokinetic system w/
particle-in-cell approach (PIC)
PIC scales N instead of N2 particles interact
w/ electromagnetic field on grid
Allows solving equation of particle motion with
ODEs (instead of nonlinear PDEs)
Main computational tasks
Scatter deposit particle charge to nearest point
Solve Poisson eqn to get potential for each
point
Gather calc force based on neighbors potential
Move particles by solving eqn of motion
Shift particles moved outside local domain

3D visualization of electrostatic potential in
magnetic fusion device
Developed at Princeton Plasma Physics Laboratory,
vectorized by Stephane Ethier
16
GTC Scatter operation

Particle charge deposited amongst nearest grid
points.
Calculate force based on neighbors potential,
then move particle accordingly
Several particles can contribute to same grid
points, resulting in memory conflicts
(dependencies) that prevent vectorization
Solution VLEN copies of charge deposition array
with reduction after main loop
However, greatly increases memory footprint (8X)
Since particles are randomly localized - scatter
also hinders cache reuse

17
GTC Performance

ES achieves fastest performance of any tested
architecture!
First time code achieved 20 of peak - compared
with less 10 on superscalar systems
Vector hybrid (OpenMP) parallelism not possible
due to increased memory requirements
P64 on ES is 1.6X faster than P1024 on Power3!
Reduced scalability due to decreasing vector
length, not MPI performance
Non-vectorizable code portions expensive on X1
Before vectorization shift routine accounted for
11 of ES and 54 of X1 overhead
Larger tests could not be performed at ES due to
parallelization/vectorization hurdles
Currently developing new version with increased
particle decomposition
Advantage of ES for PIC codes may reside in
higher statistical resolution simulations
Greater speed allow more particles per cell

18
Overview

Tremendous potential of vector architectures 4
codes running faster than ever before
Vector systems allows resolution not possible
with scalar arch (regardless of procs)
Opportunity to perform scientific runs at
unprecedented scale
ES shows high raw and much higher sustained
performance compared with X1
Limited X1 specific optimization - optimal
programming approach still unclear (CAF, etc)
Non-vectorizable code segments become very
expensive (81 or even 321 ratio)
Evaluation codes contain sufficient regularity in
computation for high vector performance
GTC example code at odds with data-parallelism
Much more difficult to evaluate codes poorly
suited for vectorization
Vectors potentially at odds w/ emerging
techniques (irregular, multi-physics,
multi-scale)
Plan to expand scope of application
domains/methods, and examine latest HPC
architectures

19
Second ES visit

Evaluate high-concurrency PARATEC performance
using large-scale Quantum Dot simulation
Evaluate CACTUS performance using updated
vectorization of radiation boundary condition
Evaluate MADCAP performance using a newly
optimized version, without global file systems
requirements and improved I/O behavior
Examine 3D version of LBMHD, and explore
optimization strategies
Evaluate GTC performance using updated
vectorization of shift routine as well as new
particle decomposition approach designed to
increase concurrency
Evaluate performance of FVCAM3 (Finite Volume
atmospheric model), at high concurrencies and
resolution (1x1.25 , 0.5 x 0.625, 0.25 x 0.375)
Papers available at http//crd.lbl.gov/oliker