Application Performance Analysis on Blue GeneL - PowerPoint PPT Presentation

1 / 17
About This Presentation
Title:

Application Performance Analysis on Blue GeneL

Description:

Understand implications of BG/L network architecture & drive results from real ... Generate walkers. Equilibrate walkers. for each step. generate QMC statistics ... – PowerPoint PPT presentation

Number of Views:62
Avg rating:3.0/5.0
Slides: 18
Provided by: eupch
Category:

less

Transcript and Presenter's Notes

Title: Application Performance Analysis on Blue GeneL


1
Application Performance Analysis on Blue Gene/L
  • Jim Pool, P.I.
  • Maciej Brodowicz, Sharon Brunett,
  • Tom Gottschalk, Dan Meiron,
  • Paul Springer, Thomas Sterling,
  • Ed Upchurch

2
Caltechs Role in Blue Gene/L Project
  • Understand implications of BG/L network
    architecture drive results from real world ASCI
    applications
  • Develop statistical models of applications,
    processors as message generators, and the network
  • Focus on
  • Application communications distribution
  • Network contention as function of load, size and
    adaptive routing
  • Represent 64K Nodes Explicitly in Statistical
    Model
  • Create trace analysis tools to characterize
    applications
  • Extensible Trace Facility (ETF)

3
Blue Gene / L Node
4
Blue Gene / L Network
5
ETF Built-in Trace Options
  • MPI events
  • All point-to-point communications (MPI-1)
  • All collective communications (MPI-1)
  • Non-blocking request tracking
  • Communicator creation and destruction
  • MPI datatype decoding (requires MPI-2)
  • Languages C, Fortran
  • Easy instrumentation of applications
  • Memory reference and program execution tracing
  • Tracking of statically and dynamically allocated
    arrays (identifiers, element sizes, dimensions)
  • Tracking of scalar variables
  • Read and write accesses to individual scalars and
    array elements as well as contiguous vectors of
    elements
  • Function calls
  • Program execution phases

6
ETF Tracing Example forMagnetic Hydro Dynamic
(MHD) Code with Adaptive Mesh Refinement (AMR)
  • Parallel MHD fluid code solves equations of
    hydrodynamics and resistive Maxwells equations
  • Part of larger application which computes dynamic
    responses to strong shock waves impinging on
    target materials
  • Fortran 90 MPI
  • MPI Cartesian communicators
  • Nearest neighbor comms use non blocking send/recv
  • MPI Allreduce for calculating stable time steps

7
AMR MHD Communication Profile
  • 20 time steps on 32 processors, 128x128 cells

Max. level 1
Max. level 2
8
Lennard-Jones Molecular Dynamics
  • Short range molecular dynamics application
    simulating Newtonian interactions in large groups
    of atoms
  • production code from Sandia National Lab
  • Simulations are large in two dimensions
  • number of atoms and number of time steps
  • Spatial decomposition case selected
  • each processing node keeps track of the positions
    and movement of the atoms in a 3-D box
  • Computations carried out in a single time step
    correspond to femto-seconds of real time
  • a meaningful simulation of the evolution of the
    systems state typically requires thousands of
    time steps
  • Point-to-point MPI messages are exchanged across
    each of the 6 sides of the box / time step
  • Code is written in Fortran and MPI

9
Lennard-Jones Molecular Dynamics
Communication Steps
Typical Grid Cell and Cutoff Radius
Computational Cycle Model
10
LJS Single Processor BG/L Performance
Original Code
vs.
Tuned for BG/L
12
10
good cache reuse
8
Improvement ()
6
4
2
0
15,625
31,250
62,500
125,000
250,000
500,000
Number of Atoms per BG/L CPU
11
LJS Molecular Dynamics Performance
Fixed Problem Size of 1 Billion Atoms
Compute Time ms
Communications Time ms
Time per single iteration (ms)
2k
4k
8k
16k
32k
64k
Number of BG/L CPUs
12
LJS Speedup BG/L vs. ASCI Red 3200 Nodes
1 Billion Atom Problem
80
70
60
50
Speedup
40
30
20
10
0
2k
4k
8k
16k
32k
64k
Number of BlueGene/L Nodes
13
LJS Communications Time
500,000 Atoms per BG/L Node
60
50
40
30
Communications Time Per Iteration (msecs)
20
Physical Nearest Neighbor Mapping
Random Mapping
10
0
4x4x4 (64 BGL Nodes)
8x8x8 (512 BGL Nodes)
16x16x16 (4096 BGL Nodes)
BG/L Configuration
14
What is QMC and Why is it a Good Fit for BG/L?
  • QMC is a finite all-electron Quantum Monte Carlo
    code used to determine quantum properties of
    materials with extremely high accuracy
  • Developed at Caltech by Bill Goddards ASCI
    Material Properties group
  • Interesting Characteristics
  • Low memory requirements
  • After initialization, highly parallel and
    scalable
  • Minimal set of MPI calls required
  • Non blocking p2p, reduction, probe, communicator,
    collective calls
  • No communications during QMC working steps
  • Communicating convergence statistics is 7200
    bytes regardless of problem size and node count
  • Code already ported to many platforms (Linux,
    AIX, IRIX, etc.)
  • C and MPI sources

15
Iterative QMC Algorithm
For each processor do Steps Total Steps /
number of processors Generate walkers
Equilibrate walkers for each step generate
QMC statistics send QMC statistics to master node

16
QMC Communications Time
For 100,000 Steps Per Node
(Reduce Using the Torus)
1
8x8x8 (512)
16x16x16 (4K)
32x16x16 (8K)
32x32x16 (16K)
32x32x32 (32K)
64x32x32 (64K)
0.1
Time (seconds)
0.01
0.001
BG/L Configuration
17
Future Application Porting and Analysis for BG/L
  • ASCI solid dynamics code simulating the
    mechanical response of polycrystalline materials,
    such as tantalum
  • Address memory constraints, grain load imbalance
    and MPI_Waitall() efficiency as we port/tune to
    BG/L
  • good stress test for BG/L robustness
  • Scalable simulation of polycrystalline response
    with assumed grain shape. The grain shape
    corresponds to the space-filling polyhedra
    corresponding to the Wigner-Seitz cell of a BCC
    crystal. The 390 grain example shown here was run
    on LLNLs IBM
  • SP3, frost.
Write a Comment
User Comments (0)
About PowerShow.com