Scientific Computations on Modern Parallel Vector Systems - PowerPoint PPT Presentation

1 / 19
About This Presentation
Title:

Scientific Computations on Modern Parallel Vector Systems

Description:

Collision requires coefficients for local gridpoint only, no communication ... Collision routine rewritten: ... Visualization of grazing collision of two black ... – PowerPoint PPT presentation

Number of Views:44
Avg rating:3.0/5.0
Slides: 20
Provided by: leonid
Learn more at: https://crd.lbl.gov
Category:

less

Transcript and Presenter's Notes

Title: Scientific Computations on Modern Parallel Vector Systems


1
Scientific Computations on Modern Parallel
Vector Systems
  • Leonid Oliker, Jonathan Carter, Andrew Canning,
    John Shalf
  • Lawrence Berkeley National Laboratories
  • Stephane Ethier
  • Princeton Plasma Physics Laboratory
  • http//crd.lbl.gov/oliker

2
Overview
  • Superscalar cache-based architectures dominate
    HPC market
  • Leading architectures are commodity-based SMPs
    due to generality and perception of cost
    effectiveness
  • Growing gap between peak sustained performance
    is well known in scientific computing
  • Modern parallel vectors may bridge gap this for
    many important applications
  • In April 2002, the Earth Simulator (ES) became
    operational Peak ES performance gt all DOE and
    DOD systems combined Demonstrated high
    sustained performance on demanding scientific
    apps
  • Conducting evaluation study of scientific
    applications on modern vector systems
  • 09/2003 MOU between ES and NERSC was
    completedFirst visit to ES center December
    8th-17th, 2003 (ES remote access not
    available)First international team to conduct
    performance evaluation study at ES
  • Examining best mapping between demanding
    applications and leading HPC systems - one size
    does not fit all

3
Vector Paradigm
  • High memory bandwidth
  • Allows systems to effectively feed ALUs (high
    byte to flop ratio)
  • Flexible memory addressing modes
  • Supports fine grained strided and irregular data
    access
  • Vector Registers
  • Hide memory latency via deep pipelining of memory
    load/stores
  • Vector ISA
  • Single instruction specifies large number of
    identical operations
  • Vector architectures allow for
  • Reduced control complexity
  • Efficiently utilize large number of computational
    resources
  • Potential for automatic discovery of parallelism
  • However most effective if sufficient regularity
    discoverable in program structure
  • Suffers even if small of code non-vectorizable
    (Amdahls Law)

4
Architectural Comparison
  • Custom vector architectures have
  • High memory bandwidth relative to peak
  • Superior interconnect latency, point to point,
    and bisection bandwidth
  • Overall ES appears as the most balanced
    architecture, while Altix shows best
    architectural balance among superscalar
    architectures
  • A key balance point for vector systems is the
    scalarvector ratio

5
Applications studied
  • LBMHD Plasma Physics 1,500 lines
    grid based
  • Lattice Boltzmann approach for magneto-hydrodynami
    cs
  • CACTUS Astrophysics 100,000 lines
    grid based
  • Solves Einsteins equations of general relativity
  • PARATEC Material Science 50,000 lines
    Fourier space/grid
  • Density Functional Theory electronic structures
    codes
  • GTC Magnetic Fusion 5,000 lines
    particle based
  • Particle in cell method for gyrokinetic
    Vlasov-Poisson equation
  • Applications chosen with potential to run at
    ultrascale
  • Computations contain abundant data parallelism
  • ES runs require minimum parallelization and
    vectorization hurdles
  • Codes originally designed for superscalar systems
  • Ported onto single node of SX6, first multi-node
    experiments performed at ESC

6
Plasma Physics LBMHD
  • LBMHD uses a Lattice Boltzmann method to model
    magneto-hydrodynamics (MHD)
  • Performs 2D simulation of high temperature plasma
  • Evolves from initial conditions and decaying to
    form current sheets
  • 2D spatial grid is coupled to octagonal streaming
    lattice
  • Block distributed over 2D processor grid

Current density decays of two cross-shaped
structures
  • Main computational components
  • Collision requires coefficients for local
    gridpoint only, no communication
  • Stream values at gridpoints are streamed to
    neighbors, at cell boundaries
    information is exchanged via MPI
  • Interpolation step required between spatial and
    stream lattices
  • Developed George Vahalas group College of
    William and Mary, ported Jonathan Carter

7
LBMHD Porting Details
(left) octagonal streaming lattice coupled with
square spatial grid
(right) example of diagonal streaming vector
updating three spatial cells
  • Collision routine rewritten
  • For ES loop ordering switched so gridpoint loop
    (1000 iterations) is inner rather than velocity
    or magnetic field loops (10 iterations)
  • X1 compiler made this transformation
    automatically multistreaming outer loop and
    vectorizing (via strip mining) inner loop
  • Temporary arrays padded reduce bank conflicts
  • Stream routine performs well
  • Array shift operations, block copies, 3rd-degree
    polynomial eval
  • Boundary value exchange
  • MPI_Isend, MPI_Irecv pairs
  • Further work plan to use ES "global memory" to
    remove message copies

8
LBMHD Performance
  • ES achieves highest performance to date over
    3.3 Tflops for P1024
  • X1 comparable absolute speed up to P64 (lower
    peak)
  • But performs 1.5X slower at P256 (decreased
    scalability)
  • CAF improved X1 to slightly exceed ES at P64 (up
    to 4.70 Gflop/P)
  • ES is 44X, 16X, and 7X faster than Power3,
    Power4, and Altix
  • Low CI (1.5) and high memory requirement (30GB)
    hurt scalar performance
  • Altix best scalar due to high memory bandwidth,
    fast interconnect

9
LBMHD on X1 MPI vs CAF
  • X1 well-suited for one-sided parallel languages
    (globally addressable mem)
  • MPI hinders this feature and requires scalar tag
    matching
  • CAF allows much simpler coding of boundary
    exchange (array subscripting)
  • feq(ista-1,jstajend,1) feq(iend,jstajend,1)ip
    rev,myrankj
  • MPI requires non-contiguous data copies into
    buffer, unpacked at destination
  • Since communication about 10 of LBMHD, only
    slight improvements
  • However, for P64 on 40962 performance degrades.
    Tradeoffs
  • CAF reduced total message volume 3X (eliminates
    user and system buffer copy)
  • But CAF used more numerous and smaller sized
    message

10
Astrophysics CACTUS
  • Numerical solution of Einsteins equations from
    theory of general relativity
  • Among most complex in physics set of coupled
    nonlinear hyperbolic elliptic systems with
    thousands of terms
  • CACTUS evolves these equations to simulate high
    gravitational fluxes, such as collision of two
    black holes

Visualization of grazing collision of two black
holes
Communication at boundariesExpect high parallel
efficiency
  • Evolves PDEs on regular grid using finite
    differences
  • Uses ADM formulation domain decomposed into 3D
    hypersurfaces for different slices of space along
    time dimension
  • Exciting new field about to be born
    Gravitational Wave Astronomy - fundamentally new
    information about Universe
  • Gravitational waves Ripples in spacetime
    curvature, caused by matter motion, causing
    distances to change.
  • Developed at Max Planck Institute, vectorized by
    John Shalf

11
CACTUS Performance
  • ES achieves fastest performance to date 45X
    faster than Power3!
  • Vector performance related to x-dim (vector
    length)
  • Excellent scaling on ES using fixed data size per
    proc (weak scaling)
  • Scalar performance better on smaller problem size
    (cache effects)
  • X1 surprisingly poor (4X slower than ES) - low
    ratio scalarvector
  • Unvectorized boundary, required 15 of runtime on
    ES and 30 on X1
  • lt 5 for the scalar version unvectorized code
    can quickly dominate cost
  • Poor superscalar performance despite high
    computational intensity
  • Register spilling due to large number of loop
    variables
  • Prefetch engines inhibited due to multi-layer
    ghost zones calculations

12
Material Science PARATEC
  • PARATEC performs first-principles quantum
    mechanical total energy calculation using
    pseudopotentials plane wave basis set
  • Density Functional Theory to calc structure
    electronic properties of new materials
  • DFT calc are one of the largest consumers of
    supercomputer cycles in the world

Induced current and chargedensity in
crystallized glycine
  • Uses all-band CG approach to obtain wavefunction
    of electrons
  • 33 3D FFT, 33 BLAS3, 33 Hand coded F90
  • Part of calculation in real space other in
    Fourier space
  • Uses specialized 3D FFT to transform wavefunction
  • Computationally intensive - generally obtains
    high percentage of peak
  • Developed Andrew Canning with Louie and Cohens
    groups (UCB, LBNL)

13
PARATECWavefunction Transpose
(a)
(b)
  • Transpose from Fourier to real space
  • 3D FFT done via 3 sets of 1D FFTs and 2
    transposes
  • Most communication in global transpose (b) to
    (c) little communication (d) to (e)
  • Many FFTs done at the same timeto avoid latency
    issues
  • Only non-zero elements communicated/calculated
  • Much faster than vendor 3D-FFT

(c)
(d)
(e)
(f)
14
PARATEC Performance
  • ES achieves fastest performance to date! Over
    2Tflop/s on 1024 procs
  • Main advantage for this type of code is fast
    interconnect system
  • X1 3.5X slower than ES (although peak is 50
    higher)
  • Non-vectorizable code can be much more expensive
    on X1 (321 vs 81)
  • Lower bisection bandwidth to computation ratio
  • Limited scalability due to increasing cost of
    global transpose and reduced vector length
  • Plan to run larger problem size next ES visit
  • Scalar architectures generally perform well due
    to high computational intensity
  • Power3, Power4, Alitx are 8X, 4X, 1.5X slower
    than ES
  • Vector arch allow opportunity to simulate systems
    not possible on scalar platforms

15
Magnetic Fusion GTC
  • Gyrokinetic Toroidal Code transport of thermal
    energy (plasma microturbulence)
  • Goal magnetic fusion is burning plasma power
    plant producing cleaner energy
  • GTC solves 3D gyroaveraged gyrokinetic system w/
    particle-in-cell approach (PIC)
  • PIC scales N instead of N2 particles interact
    w/ electromagnetic field on grid
  • Allows solving equation of particle motion with
    ODEs (instead of nonlinear PDEs)
  • Main computational tasks
  • Scatter deposit particle charge to nearest point
  • Solve Poisson eqn to get potential for each
    point
  • Gather calc force based on neighbors potential
  • Move particles by solving eqn of motion
  • Shift particles moved outside local domain

3D visualization of electrostatic potential in
magnetic fusion device
Developed at Princeton Plasma Physics Laboratory,
vectorized by Stephane Ethier
16
GTC Scatter operation
  • Particle charge deposited amongst nearest grid
    points.
  • Calculate force based on neighbors potential,
    then move particle accordingly
  • Several particles can contribute to same grid
    points, resulting in memory conflicts
    (dependencies) that prevent vectorization
  • Solution VLEN copies of charge deposition array
    with reduction after main loop
  • However, greatly increases memory footprint (8X)
  • Since particles are randomly localized - scatter
    also hinders cache reuse

17
GTC Performance
  • ES achieves fastest performance of any tested
    architecture!
  • First time code achieved 20 of peak - compared
    with less 10 on superscalar systems
  • Vector hybrid (OpenMP) parallelism not possible
    due to increased memory requirements
  • P64 on ES is 1.6X faster than P1024 on Power3!
  • Reduced scalability due to decreasing vector
    length, not MPI performance
  • Non-vectorizable code portions expensive on X1
  • Before vectorization shift routine accounted for
    11 of ES and 54 of X1 overhead
  • Larger tests could not be performed at ES due to
    parallelization/vectorization hurdles
  • Currently developing new version with increased
    particle decomposition
  • Advantage of ES for PIC codes may reside in
    higher statistical resolution simulations
  • Greater speed allow more particles per cell

18
Overview
  • Tremendous potential of vector architectures 4
    codes running faster than ever before
  • Vector systems allows resolution not possible
    with scalar arch (regardless of procs)
  • Opportunity to perform scientific runs at
    unprecedented scale
  • ES shows high raw and much higher sustained
    performance compared with X1
  • Limited X1 specific optimization - optimal
    programming approach still unclear (CAF, etc)
  • Non-vectorizable code segments become very
    expensive (81 or even 321 ratio)
  • Evaluation codes contain sufficient regularity in
    computation for high vector performance
  • GTC example code at odds with data-parallelism
  • Much more difficult to evaluate codes poorly
    suited for vectorization
  • Vectors potentially at odds w/ emerging
    techniques (irregular, multi-physics,
    multi-scale)
  • Plan to expand scope of application
    domains/methods, and examine latest HPC
    architectures

19
Second ES visit
  • Evaluate high-concurrency PARATEC performance
    using large-scale Quantum Dot simulation
  • Evaluate CACTUS performance using updated
    vectorization of radiation boundary condition
  • Evaluate MADCAP performance using a newly
    optimized version, without global file systems
    requirements and improved I/O behavior
  • Examine 3D version of LBMHD, and explore
    optimization strategies
  • Evaluate GTC performance using updated
    vectorization of shift routine as well as new
    particle decomposition approach designed to
    increase concurrency
  • Evaluate performance of FVCAM3 (Finite Volume
    atmospheric model), at high concurrencies and
    resolution (1x1.25 , 0.5 x 0.625, 0.25 x 0.375)
  • Papers available at http//crd.lbl.gov/oliker
Write a Comment
User Comments (0)
About PowerShow.com