Petascale Programming with Virtual Processors: Charm , AMPI, and domainspecific frameworks - PowerPoint PPT Presentation

1 / 142
About This Presentation
Title:

Petascale Programming with Virtual Processors: Charm , AMPI, and domainspecific frameworks

Description:

IBM's Blue Gene L: 65k nodes, 3D-taurus topology. Red Storm (10k procs) Future? ... Jay deSouza. Chao Huang. Chee Wai Lee. Recent Funding: NSF (NGS: Frederica Darema) ... – PowerPoint PPT presentation

Number of Views:83
Avg rating:3.0/5.0
Slides: 143
Provided by: san7199
Category:

less

Transcript and Presenter's Notes

Title: Petascale Programming with Virtual Processors: Charm , AMPI, and domainspecific frameworks


1
Petascale Programming with Virtual
ProcessorsCharm, AMPI, and domain-specific
frameworks
  • Laxmikant Kale
  • http//charm.cs.uiuc.edu
  • Parallel Programming Laboratory
  • Dept. of Computer Science
  • University of Illinois at Urbana Champaign

2
Outline
  • Challenges and opportunities character of the
    new machines
  • Charm and AMPI
  • Basics
  • Capabilities,
  • Programming techniques
  • Dice them fine
  • VPS to the rescue
  • Juggling for overlap
  • Load balancing
  • scenarios and strategies
  • Case studies
  • Classical Molecular Dynamics
  • Car-Parinello AI MD Quantum Chemistry
  • Rocket SImulation
  • Raising level of abstraction
  • Higher level compiler supported notations
  • Domain-specific frameworks
  • Example
  • Unstructured mesh (FEM) framework

3
Machines current, planned and future
  • Current
  • Lemieux 3000 processors, 750 nodes,
    full-bandwidth fat-tree network
  • ASCI Q similar architecture
  • System X Infiniband
  • Tungston myrinet
  • Thunder
  • Earth Simulator
  • Planned
  • IBMs Blue Gene L 65k nodes, 3D-taurus topology
  • Red Storm (10k procs)
  • Future?
  • BG/L is an example
  • 1M processors!
  • 0.5 MB per procesor
  • HPCS 3 architecural plans

4
Some Trends Communication
  • Bisection bandwidth
  • Cant scale as well with number of processors
  • without being expensive
  • Wire-length delays
  • even on lemieux messages going thru the highest
    level switches take longer
  • Two possibilities
  • Grid topologies, with near neighbor connections
  • High Link speed, low bisection bandwidth
  • Expensive, full-bandwidth networks

5
Trends Memory
  • Memory latencies are 100 times slower than
    processor!
  • This will get worse
  • A solution put more processors in,
  • To increase bandwidth between processors and
    memory
  • On chip DRAM
  • In other words low memory-to-processor ratio
  • But this can be handled with programming style
  • Application viewpoint, for physical modeling
  • Given a fixed amount of run-time (4 hours or 10
    days)
  • Doubling spatial resolution
  • increases CPU needs more than 2-fold (smaller
    time-steps)

6
Application Complexity is increasing
  • Why?
  • With more FLOPS, need better algorithms..
  • Not enough to just do more of the same..
  • Example Dendritic growth in materials
  • Better algorithms lead to complex structure
  • Example Gravitational force calculation
  • Direct all-pairs O(N2), but easy to parallelize
  • Barnes-Hut N log(N) but more complex
  • Multiple modules, dual time-stepping
  • Adaptive and dynamic refinements
  • Ambitious projects
  • Projects with new objectives lead to dynamic
    behavior and multiple components

7
Specific Programming Challenges
  • Explicit management of resources
  • This data on that processor
  • This work on that processor
  • Analogy memory management
  • We declare arrays, and malloc dynamic memory
    chunks as needed
  • Do not specify memory addresses
  • As usual, Indirection is the key
  • Programmer
  • This data, partitioned into these pieces
  • This work divided that way
  • System map data and work to processors

8
Virtualization Object-based Parallelization
  • Idea Divide the computation into a large number
    of objects
  • Let the system map objects to processors

User is only concerned with interaction between
objects
System implementation
User View
9
Virtualization Charm and AMPI
  • These systems seek an optimal division of labor
    between the system and programmer
  • Decomposition done by programmer,
  • Everything else automated

Decomposition
Mapping
HPF
Charm
Abstraction
Scheduling
MPI
Expression
Specialization
10
Charm and Adaptive MPI
  • Charm Parallel C
  • Asynchronous methods
  • Object arrays
  • In development for over a decade
  • Basis of several parallel applications
  • Runs on all popular parallel machines and
    clusters
  • AMPI A migration path for legacy MPI codes
  • Allows them dynamic load balancing capabilities
    of Charm
  • Uses Charm object arrays
  • Minimal modifications to convert existing MPI
    programs
  • Automated via AMPizer
  • Collaboration w. David Padua
  • Bindings for
  • C, C, and Fortran90

Both available from http//charm.cs.uiuc.edu
11
Protein Folding
Molecular Dynamics
Quantum Chemistry (QM/MM)
Computational Cosmology
Parallel Objects, Adaptive Runtime System
Libraries and Tools
Crack Propagation
Space-time meshes
Dendritic Growth
Rocket Simulation
The enabling CS technology of parallel objects
and intelligent Runtime systems has led to
several collaborative applications in CSE
12
Message From This Talk
  • Virtualization is ready and powerful to meet the
    needs of tomorrows applications and machines
  • Virtualization and associated techniques that we
    have been exploring for the past decade are ready
    and powerful enough to meet the needs of high-end
    parallel computing and complex and dynamic
    applications
  • These techniques are embodied into
  • Charm
  • AMPI
  • Frameworks (Strucured Grids, Unstructured Grids,
    Particles)
  • Virtualization of other coordination languages
    (UPC, GA, ..)

13
Acknowlwdgements
  • Graduate students including
  • Gengbin Zheng
  • Orion Lawlor
  • Milind Bhandarkar
  • Terry Wilmarth
  • Sameer Kumar
  • Jay deSouza
  • Chao Huang
  • Chee Wai Lee
  • Recent Funding
  • NSF (NGS Frederica Darema)
  • DOE (ASCI Rocket Center)
  • NIH (Molecular Dynamics)

14
Charm Object Arrays
  • A collection of data-driven objects (aka chares),
  • With a single global name for the collection, and
  • Each member addressed by an index
  • Mapping of element objects to processors handled
    by the system

Users view
A0
A1
A2
A3
A..
15
Charm Object Arrays
  • A collection of chares,
  • with a single global name for the collection, and
  • each member addressed by an index
  • Mapping of element objects to processors handled
    by the system

Users view
A0
A1
A2
A3
A..
System view
A3
A0
16
Chare Arrays
  • Elements are data-driven objects
  • Elements are indexed by a user-defined data
    type-- sparse 1D, 2D, 3D, tree, ...
  • Send messages to index, receive messages at
    element. Reductions and broadcasts across the
    array
  • Dynamic insertion, deletion, migration-- and
    everything still has to work!

17
Charm Remote Method Calls
  • To call a method on a remote C object foo, use
    the local proxy C object CProxy_foo generated
    from the interface file

method and parameters
  • This results in a network message, and eventually
    to a call to the real objects method

In another .C file
void foobar(int x) ...
18
Charm Startup Process Main
module myModule array1D foo entry
foo(int problemNo) entry void bar(int x)
mainchare myMain entry myMain(int
argc,char argv)
Interface (.ci) file
Special startup object
Generated class
include myModule.decl.h class myMain public
CBase_myMain myMain(int argc,char argv)
int nElements7, inElements/2 CProxy_foo
fCProxy_foockNew(2,nElements)
fi.bar(3) include myModule.def.h
Called at startup
In a .C file
19
Other Features
  • Broadcasts and Reductions
  • Runtime creation and deletion
  • nD and sparse array indexing
  • Library support (modules)
  • Groups per-processor objects
  • Node Groups per-node objects
  • Priorities control ordering

20
AMPI Adaptive MPI
  • MPI interface, for C and Fortran, implemented on
    Charm
  • Multiple virtual processors per physical
    processor
  • Implemented as user-level threads
  • Very fast context switching-- 1us
  • E.g., MPI_Recv only blocks virtual processor, not
    physical
  • Supports migration (and hence load balancing) via
    extensions to MPI

21
AMPI
22
AMPI
Implemented as virtual processors (user-level
migratable threads)
23
How to Write an AMPI Program
  • Write your normal MPI program, and then
  • Link and run with Charm
  • Compile and link with charmc
  • charmc -o hello hello.c -language ampi
  • charmc -o hello2 hello.f90 -language ampif
  • Run with charmrun
  • charmrun hello

24
How to Run an AMPI program
  • Charmrun
  • A portable parallel job execution script
  • Specify number of physical processors pN
  • Specify number of virtual MPI processes vpN
  • Special nodelist file for net- versions

25
AMPI MPI Extensions
  • Process Migration
  • Asynchronous Collectives
  • Checkpoint/Restart

26
How to Migrate a Virtual Processor?
  • Move all application state to new processor
  • Stack Data
  • Subroutine variables and calls
  • Managed by compiler
  • Heap Data
  • Allocated with malloc/free
  • Managed by user
  • Global Variables

27
Stack Data
  • The stack is used by the compiler to track
    function calls and provide temporary storage
  • Local Variables
  • Subroutine Parameters
  • C alloca storage
  • Most of the variables in a typical application
    are stack data

28
Migrate Stack Data
  • Without compiler support, cannot change stacks
    address
  • Because we cant change stacks interior pointers
    (return frame pointer, function arguments, etc.)
  • Solution isomalloc addresses
  • Reserve address space on every processor for
    every thread stack
  • Use mmap to scatter stacks in virtual memory
    efficiently
  • Idea comes from PM2

29
Migrate Stack Data
Processor As Memory
Processor Bs Memory
0xFFFFFFFF
0xFFFFFFFF
Thread 1 stack
Thread 2 stack
Migrate Thread 3
Thread 3 stack
Thread 4 stack
Heap
Heap
Globals
Globals
Code
Code
0x00000000
0x00000000
30
Migrate Stack Data
Processor As Memory
Processor Bs Memory
0xFFFFFFFF
0xFFFFFFFF
Thread 1 stack
Thread 2 stack
Migrate Thread 3
Thread 3 stack
Thread 4 stack
Heap
Heap
Globals
Globals
Code
Code
0x00000000
0x00000000
31
Migrate Stack Data
  • Isomalloc is a completely automatic solution
  • No changes needed in application or compilers
  • Just like a software shared-memory system, but
    with proactive paging
  • But has a few limitations
  • Depends on having large quantities of virtual
    address space (best on 64-bit)
  • 32-bit machines can only have a few gigs of
    isomalloc stacks across the whole machine
  • Depends on unportable mmap
  • Which addresses are safe? (We must guess!)
  • What about Windows? Blue Gene?

32
Heap Data
  • Heap data is any dynamically allocated data
  • C malloc and free
  • C new and delete
  • F90 ALLOCATE and DEALLOCATE
  • Arrays and linked data structures are almost
    always heap data

33
Migrate Heap Data
  • Automatic solution isomalloc all heap data just
    like stacks!
  • -memory isomalloc link option
  • Overrides malloc/free
  • No new application code needed
  • Same limitations as isomalloc
  • Manual solution application moves its heap data
  • Need to be able to size message buffer, pack data
    into message, and unpack on other side
  • pup abstraction does all three

34
Comparison with Native MPI
  • Performance
  • Slightly worse w/o optimization
  • Being improved
  • Flexibility
  • Small number of PE available
  • Special requirement by algorithm

Problem setup 3D stencil calculation of size
2403 run on Lemieux. AMPI runs on any of PEs
(eg 19, 33, 105). Native MPI needs cube .
35
Benefits of Virtualization
  • Software engineering
  • Number of virtual processors can be independently
    controlled
  • Separate VPs for different modules
  • Message driven execution
  • Adaptive overlap of communication
  • Modularity
  • Predictability
  • Automatic out-of-core
  • Asynchronous reductions
  • Dynamic mapping
  • Heterogeneous clusters
  • Vacate, adjust to speed, share
  • Automatic checkpointing
  • Change set of processors used
  • Principle of persistence
  • Enables runtime optimizations
  • Automatic dynamic load balancing
  • Communication optimizations
  • Other runtime optimizations

More info http//charm.cs.uiuc.edu
36
Data driven execution
Scheduler
Scheduler
Message Q
Message Q
37
Adaptive Overlap of Communication
  • With Virtualization, you get Data-driven
    execution
  • There are multiple entities (objects, threads) on
    each proc
  • No single object or threads holds up the
    processor
  • Each one is continued when its data arrives
  • No need to guess which is likely to arrive first
  • So Achieves automatic and adaptive overlap of
    computation and communication
  • This kind of data-driven idea can be used in MPI
    as well.
  • Using wild-card receives
  • But as the program gets more complex, it gets
    harder to keep track of all pending communication
    in all places that are doing a receive

38
Why Message-Driven Modules ?
SPMD and Message-Driven Modules (From A. Gursoy,
Simplified expression of message-driven programs
and quantification of their impact on
performance, Ph.D Thesis, Apr 1994.)
39
Checkpoint/Restart
  • Any long running application must be able to save
    its state
  • When you checkpoint an application, it uses the
    pup routine to store the state of all objects
  • State information is saved in a directory of your
    choosing
  • Restore also uses pup, so no additional
    application code is needed (pup is all you need)

40
Checkpointing Job
  • In AMPI, use MPI_Checkpoint(ltdirgt)
  • Collective call returns when checkpoint is
    complete
  • In Charm, use CkCheckpoint(ltdirgt,ltresumegt)
  • Called on one processor calls resume when
    checkpoint is complete
  • Restarting
  • The charmrun option restart ltdirgt is used to
    restart
  • Number of processors need not be the same

41
AMPIs Collective Communication Support
  • Communication operation in which all or a large
    subset participate
  • For example broadcast
  • Performance impediment
  • All to all communication
  • All to all personalized communication (AAPC)
  • All to all multicast (AAM)

42
Communication Optimization
Organize processors in a 2D (virtual) Mesh
Message from (x1,y1) to (x2,y2) goes via (x1,y2)
2 messages instead of P-1
But each byte travels twice on the network
43
Performance Benchmark
A Mystery ?
44
CPU time vs Elapsed Time
  • Time breakdown of an all-to-all operation using
    Mesh library
  • Computation is only a small proportion of the
    elapsed time
  • A number of optimization techniques are developed
    to improve collective communication performance

45
Asynchronous Collectives
  • Time breakdown of 2D FFT benchmark ms
  • VPs implemented as threads
  • Overlapping computation with waiting time of
    collective operations
  • Total completion time reduced

46
Shrink/Expand
  • Problem Availability of computing platform may
    change
  • Fitting applications on the platform by object
    migration

Time per step for the million-row CG solver on a
16-node cluster Additional 16 nodes available at
step 600
47
Projections Performance Analysis Tool
48
Projections
  • Projections is designed for use with a
    virtualized model like Charm or AMPI
  • Instrumentation built into runtime system
  • Post-mortem tool with highly detailed traces as
    well as summary formats
  • Java-based visualization tool for presenting
    performance information

49
Trace Generation (Detailed)
  • Link-time option -tracemode projections
  • In the log mode each event is recorded in full
    detail (including timestamp) in an internal
    buffer
  • Memory footprint controlled by limiting number of
    log entries
  • I/O perturbation can be reduced by increasing
    number of log entries
  • Generates a ltnamegt.ltpegt.log file for each
    processor and a ltnamegt.sts file for the entire
    application
  • Commonly used Run-time options
  • traceroot DIR
  • logsize NUM

50
Visualization Main Window
51
Post mortem analysis views
  • Utilization Graph
  • Mainly useful as a function of processor
    utilization against time and time spent on
    specific parallel methods
  • Profile stacked graphs
  • For a given period, breakdown of the time on each
    processor
  • Includes idle time, and message-sending,
    receiving times
  • Timeline
  • upshot-like, but more details
  • Pop-up views of method execution, message arrows,
    user-level events

52
(No Transcript)
53
Projections Views continued
  • Histogram of method execution times
  • How many method-execution instances had a time of
    0-1 ms? 1-2 ms? ..

54
Projections Views continued
  • Overview
  • A fast utilization chart for entire machine
    across the entire time period

55
(No Transcript)
56
Projections Conclusions
  • Instrumentation built into runtime
  • Easy to include in Charm or AMPI program
  • Working on
  • Automated analysis
  • Scaling to tens of thousands of processors
  • Integration with hardware performance counters

57
Multi-run analysis in progress
  • Collect performance data from different runs
  • On varying number of processors
  • See which functions increase in computation time
  • Algorithmic overhead
  • See how the communication costs scale up
  • per processor and total

58
Load Balancing
59
Load balancing scenarios
  • Dynamic creation of tasks
  • Initial vs continuous
  • Coarse grained vs fine grained tasks
  • Master-slave
  • Tree structured
  • Use Seed Balancers in Charm/AMPI
  • Iterative Computations
  • When there is a strong correlation across
    iterations
  • Measurement based load balancers
  • When the correlation is weak
  • Wehen there is no co-relation use seed baalncer

60
Measurement Based Load Balancing
  • Principle of persistence
  • Object communication patterns and computational
    loads tend to persist over time
  • In spite of dynamic behavior
  • Abrupt but infrequent changes
  • Slow and small changes
  • Runtime instrumentation
  • Measures communication volume and computation
    time
  • Measurement based load balancers
  • Use the instrumented data-base periodically to
    make new decisions
  • Many alternative strategies can use the database

61
Periodic Load balancing Strategies
  • Stop the computation?
  • Centralized strategies
  • Charm RTS collects data (on one processor) about
  • Computational Load and Communication for each
    pair
  • If you are not using AMPI/Charm, you can do the
    same instrumentation and data collection
  • Partition the graph of objects across processors
  • Take communication into account
  • Pt-to-pt, as well as multicast over a subset
  • As you map an object, add to the load on both
    sending and receiving processor
  • The red communication is free, if it is a
    multicast.

62
Object partitioning strategies
  • You can use graph partitioners like METIS, K-R
  • BUT graphs are smaller, and optimization
    criteria are different
  • Greedy strategies
  • If communication costs are low use a simple
    greedy strategy
  • Sort objects by decreasing load
  • Maintain processors in a heap (by assigned load)
  • In each step assign the heaviest remaining
    object to the least loaded processor
  • With small-to-moderate communication cost
  • Same strategy, but add communication costs as you
    add an object to a processor
  • Always add a refinement step at the end
  • Swap work from heaviest loaded processor to some
    other processor
  • Repeat a few times or until no improvement
  • Refinement-only strategies

63
Object partitioning strategies
  • When communication cost is significant
  • Still use greedy strategy, but
  • At each assignment step, choose between assigning
    O to least loaded processor and the processor
    that already has objects that communicate most
    with O.
  • Based on the degree of difference in the two
    metrics
  • Two-stage assignments
  • In early stages, consider communication costs as
    long as the processors are in the same (broad)
    load class,
  • In later stages, decide based on load
  • Branch-and-bound
  • Searches for optimal, but can be stopped after a
    fixed time

64
Crack Propagation
Decomposition into 16 chunks (left) and 128
chunks, 8 for each PE (right). The middle area
contains cohesive elements. Both decompositions
obtained using Metis. Pictures S. Breitenfeld,
and P. Geubelle
As computation progresses, crack propagates, and
new elements are added, leading to more complex
computations in some chunks
65
Load balancer in action
Automatic Load Balancing in Crack Propagation
1. Elements Added
3. Chunks Migrated
2. Load Balancer Invoked
66
Distributed Load balancing
  • Centralized strategies
  • Still ok for 3000 processors for NAMD
  • Distributed balancing is needed when
  • Number of processors is large and/or
  • load variation is rapid
  • Large machines
  • Need to handle locality of communication
  • Topology sensitive placement
  • Need to work with scant global information
  • Approximate or aggregated global information
    (average/max load)
  • Incomplete global info (only neighborhood)
  • Work diffusion strategies (1980s work by author
    and others!)
  • Achieving global effects by local action

67
Other features
  • Client-Server interface (CCS)
  • Live Visualization support
  • Libraries
  • Communication optimization libraries
  • 2D, 3D FFTs, CG, ..
  • Debugger
  • freeze/thaw

68
Scaling to PetaFLOPS machines Advice
  • Dice them fine
  • Use a fine grained decomposition
  • Just enough to amortize the overhead
  • Juggle as much as you can
  • Keeping communication ops in flight for latency
    tolerance
  • Avoid synchronizations as much as possible
  • Use asynchronous reductions,
  • Async. Collectives in general

69
Grainsize control
  • A Simple definition of grainsize
  • Amount of computation per message
  • Problem short message/ long message
  • More realistic
  • Computation to communication ratio

70
Grainsize Control Wisdom
  • One may think that
  • One should chose the largest grainsize that will
    generate sufficient parallelization
  • In fact
  • One should select smallest grainsize that will
    amortize the overhead
  • Total CPU Time T
  • T Tseq (Tseq/g)Toverhead

71
How to avoid Barriers/Reductions
  • Sometimes, they can be eliminated
  • with careful reasoning
  • Somewhat complex programming
  • When they cannot be avoided,
  • one can often render them harmless
  • Use asynchronous reduction (not normal MPI)
  • E.g. in NAMD, energies need to be computed via a
    reductions and output.
  • Not used for anything except output
  • Use Asynchronous reduction, working in the
    background
  • When it reports to an object at the root, output
    it

72
Asynchronous reductions Jacobi
  • Convergence check
  • At the end of each Jacobi iteration, we do a
    convergence check
  • Via a scalar Reduction (on maxError)
  • But note
  • each processor can maintain old data for one
    iteration
  • So, use the result of the reduction one iteration
    later!
  • Deposit of reduction is separated from its
    result.
  • MPI_Ireduce(..) returns a handle (like MPI_Irecv)
  • And later, MPI_Wait(handle) will block when you
    need to.

73
Asynchronous reductions in Jacobi
reduction
Processor timeline with sync. reduction
compute
compute
This gap is avoided below
Processor timeline with async. reduction
reduction
compute
compute
74
Asynchronous or Split-phase interfaces
  • Notify/wait syncs in CAF

75
Case Studies Examples of Scalability
  • Series of examples
  • Where we attained scalability
  • What techniques were useful
  • What lessons we learned
  • Molecular Dynamics NAMD
  • Rocket Simulation

76

Object Based Parallelization for MD Force
Decomposition Spatial Deomp.
  • Now, we have many objects to load balance
  • Each diamond can be assigned to any proc.
  • Number of diamonds (3D)
  • 14Number of Patches

77
Bond Forces
  • Multiple types of forces
  • Bonds(2), Angles(3), Dihedrals (4), ..
  • Luckily, each involves atoms in neighboring
    patches only
  • Straightforward implementation
  • Send message to all neighbors,
  • receive forces from them
  • 262 messages per patch!
  • Instead, we do
  • Send to (7) upstream nbrs
  • Each force calculated at one patch

78
Virtualized Approach to implementation using
Charm
192 144 VPs
700 VPs
30,000 VPs
These 30,000 Virtual Processors (VPs) are
mapped to real processors by charm runtime system
79
(No Transcript)
80
Case Study NAMD (Molecular Dynamics)
1.02 TeraFLOPs
NAMD Biomolecular Simulation on Thousands of
Processors J. C. Phillips, G. Zheng, S. Kumar,
and L. V. Kale Proc. Of Supercomputing
2002 Gordon Bell Award Unprecedented performance
for this application
ATPase synthase
81
Scaling to 64K/128K processors of BG/L
  • What issues will arise?
  • Communication
  • Bandwidth use more important than processor
    overhead
  • Locality
  • Global Synchronizations
  • Costly, but not because it takes longer
  • Rather, small jitters have a large impact
  • Sum of Max vs Max of Sum
  • Load imbalance important, but low grainsize is
    crucial
  • Critical paths gains importance

82
Electronic Structures using CP
  • Car-Parinello method
  • Based on pinyMD
  • Glenn Martyna, Mark Tuckerman
  • Data structures
  • A bunch of states (say 128)
  • Represented as
  • 3D arrays of coeffs in G-space, and
  • also 3D arrays in real space
  • Real-space prob. density
  • S-matrix one number for each pair of states
  • For orthonormalization
  • Nuclei
  • Computationally
  • Transformation from g-space to real-space
  • Use multiple parallel 3D-FFT
  • Sums up real-space densities
  • Computes energies from density
  • Computes forces
  • Normalizes g-space wave function

83
One Iteration
84
Parallel Implementation
85
(No Transcript)
86
Orthonormalization
  • At the end of every iteration, after updating
    electron configuration
  • Need to compute (from states)
  • a correlation matrix S, Si,j depends on
    entire data from states i, j
  • its transform T
  • Update the values
  • Computation of S has to be distributed
  • Compute Si,j,p, where p is plane number
  • Sum over p to get Si,j
  • Actual conversion from S-gtT is sequential

87
Orthonormalization
88
Computation/Communication Overlap
89
G-Space PlanesIntegration, 1D-FFT
Real-Space Planes 2D-FFT
Compute Forces on/by Nuclei
Rho-Real-Space Planes
Rho-Real-Space Planes
Real-Space Planes 2D-IFFT
G-Space PlanesIntegration, 1D-IFFT
Pair-calculators
90
(No Transcript)
91
(No Transcript)
92
(No Transcript)
93
(No Transcript)
94
Rocket Simulation
  • Dynamic, coupled physics simulation in 3D
  • Finite-element solids on unstructured tet mesh
  • Finite-volume fluids on structured hex mesh
  • Coupling every timestep via a least-squares data
    transfer
  • Challenges
  • Multiple modules
  • Dynamic behavior burning surface, mesh adaptation

Robert Fielder, Center for Simulation of Advanced
Rockets
Collaboration with M. Heath, P. Geubelle, others
95
Application Example GEN2
96
Rocket simulation via virtual processors
  • Scalability challenges
  • Multiple independently developed modules,
  • possibly executing concurrently
  • Evolving simulation
  • Changes the balance between fluid and solid
  • Adaptive refinements
  • Dynamic insertion of sub-scale simulation
    components
  • Crack-driven fluid flow and combustion
  • Heterogeneous (speed-wise) clusters

97
Rocket simulation via virtual processors
98
AMPI and Roc Communication
  • By separating independent modules into separate
    sets of virtual processors, flexibility was
    gained to deal with alternate formulations
  • Fluids and solids executing concurrently OR one
    after other.
  • Change in pattern of load distribution within or
    across modules

Rocflo
Rocflo
Rocflo
Rocflo
Rocflo
99
Using a SImulator
100
Performance Prediction on Large Machines
  • Solution
  • Leverage virtualization
  • Develop a machine emulator
  • Simulator accurate time modeling
  • Run a program on 100,000 processors using only
    hundreds of processors
  • Problem
  • How to develop a parallel application for a
    non-existent machine?
  • How to predict performance of applications on
    future machines?
  • How to do performance tuning without continuous
    access to a large machine?

Originally targeted to BlueGene/Cyclops Now
generalized (and used for BG/L)
101
Why Emulate?
  • Allow development of parallel software
  • Exposes scalability limitations in data
    structures, e.g.
  • O(P) arrays are ok if P is not a million
  • Software is ready before the machine is

102
How to emulate 1M processor apps
  • Leverage processor virtualization
  • Let each Virtual Processor of Charm stand for a
    real processor of the emulated machine
  • Adequate if you want emulate MPI apps on 1 M
    processors!

103
Emulation on a Parallel Machine
Simulated multi-processor nodes
Simulating (Host) Processors
Emulating 8M threads on 96 ASCI-Red processors
104
How to emulate 1M processor apps
  • A twist what if you want to emulate Charm app?
  • E.g. 8 M objects VPs using 1M target machine
    processors?
  • A little runtime trickery
  • Processors modeled as data structures, while VPs
    as VPs!

VP
VP
VP
VP
Processor
Processor
105
Memory Limit?
  • Some applications have low memory use
  • Molecular dynamics
  • Some Large machines may have low
    memory-per-processor
  • E.g. BG/L 256 MB for 2 processor node
  • A BG/C design 16-32 MB for 32 processor node
  • More general solution is still needed
  • Provided by out-of-core execution capability of
    Charm

106
Message Driven Execution and Out-of-Core Execution
Virtualization leads to Message Driven Execution
So, we can Prefetch data accurately Automatic
Out-of-core execution
107
A Success Story
  • Emulation based implementation of lower layers,
    as well as Charm and AMPI completed last year
  • As a result,
  • BG/L Port of Charm/AMPI accomplished in 1-2
    days
  • Actually, 1-2 hours for the basic port
  • 1-2 days to fix a OS level bug that prevented
    user-level multi-threading

108
Emulator Performance
  • Scalable
  • Emulating a real-world MD application on a 200K
    processor BG machine

Gengbin Zheng, Arun Singla, Joshua Unger,
Laxmikant V. Kalé, A Parallel-Object
Programming Model for PetaFLOPS Machines and Blue
Gene/Cyclops'' in NGS Program Workshop, IPDPS02
109
Performance Prediction
  • How to predict component performance?
  • Multiple resolution levels
  • Sequential Component
  • user supplied expression timers performance
    counters instruction level simulation
  • Communication component
  • Simple latency-based network model
    contention-based network simulation
  • Parallel Discrete Event Simulation (PDES)
  • Logical processor (LP) has virtual clock
  • Events are time-stamped
  • State of an LP changes when an event arrives to
    it
  • ProtocolsConservative vs. optimistic protocols
  • Conservative (example DaSSF, MPISIM)
  • Optimistic (examples Time Warp, SPEEDES)

110
Why not use existing PDES?
  • Major synchronization overheads
  • Checkpointing overhead
  • Rollback overhead
  • We can do better
  • Inherent determinacy of parallel application
  • Most parallel programs are written to be
    deterministic,

111
Categories of Applications
  • Linear-order applications
  • No wildcard receives
  • Strong determinacy, no timestamp correction
    necessary
  • Reactive applications (atomic)
  • Message driven objects
  • Methods execute as corresponding messages arrive
  • Multi-dependent applications
  • Irecvs with WaitAll
  • Uses of structured dagger to capture dependency

Gengbin Zheng, Gunavardhan Kakulapati, Laxmikant
V. Kalé, BigSim A Parallel Simulator for
Performance Prediction of Extremely Large
Parallel Machines'', in IPDPS 2004
112
Architecture of BigSim Simulator
Performance visualization (Projections)
Simulation output trace logs
Online PDES engine
Charm Runtime
Instruction Sim (RSim, IBM, ..)
Simple Network Model
Performance counters
Load Balancing Module
BigSim Emulator
Charm and MPI applications
113
Architecture of BigSim Simulator
Performance visualization (Projections)
Network Simulator
Offline PDES
BigNetSim (POSE)
Simulation output trace logs
Online PDES engine
Charm Runtime
Instruction Sim (RSim, IBM, ..)
Simple Network Model
Performance counters
Load Balancing Module
BigSim Emulator
Charm and MPI applications
114
Big Network Simulation
  • Simulate network behavior packetization,
    routing, contention, etc.
  • Incorporate with post-mortem timestamp correction
    via POSE
  • Currently models torus (BG/L), fat-tree (qs-net)

115
BigSim Validation on Lemieux
116
Performance of the BigSim
Speed up
Real processors (PSC Lemieux)
117
FEM simulation
  • Simple 2D structural simulation in AMPI
  • 5 million element mesh
  • 16k BG processors
  • Running on only 32 PSC Lemieux processors

118
Case Study - LeanMD
  • Molecular dynamics simulation designed for large
    machines
  • K-away cut-off parallelization
  • Benchmark er-gre with 3-away
  • 36573 atoms
  • 1.6 million objects vs. 6000 in 1-away
  • 8 step simulation
  • 32k processor BG machine
  • Running on 400 PSC Lemieux processors

Performance visualization tools
119
Load Imbalance
Histogram
120
Performance visualization
121
Domain Specific Frameworks
122
Component Frameworks
  • Motivation
  • Reduce tedium of parallel programming for
    commonly used paradigms
  • Encapsulate required parallel data structures and
    algorithms
  • Provide easy to use interface,
  • Sequential programming style preserved
  • No alienating invasive constructs
  • Use adaptive load balancing framework
  • Component frameworks
  • FEM
  • Multiblock
  • AMR

123
FEM framework
  • Present clean, almost serial interface
  • Hide parallel implementation in the runtime
    system
  • Leave physics and time integration to user
  • Users write code similar to sequential code
  • Or, easily modify sequential code
  • Input
  • connectivity file (mesh), boundary data and
    initial data
  • Framework
  • Partitions data, and
  • Starts driver for each chunk in a separate thread
  • Automates communication, once user registers
    fields to be communicated
  • Automatic dynamic load balancing

124
Why use the FEM Framework?
  • Makes parallelizing a serial code faster and
    easier
  • Handles mesh partitioning
  • Handles communication
  • Handles load balancing (via Charm)
  • Allows extra features
  • IFEM Matrix Library
  • NetFEM Visualizer
  • Collision Detection Library

125
Serial FEM Mesh
126
Partitioned Mesh
127
FEM Mesh Node Communication
  • Summing forces from other processors only takes
    one call
  • FEM_Update_field
  • Similar call for updating ghost regions

128
FEM Framework Users CSAR
  • Rocflu fluids solver, a part of GENx
  • Finite-volume fluid dynamics code
  • Uses FEM ghost elements
  • Author Andreas Haselbacher

Robert Fielder, Center for Simulation of Advanced
Rockets
129
FEM Experience
  • Previous
  • 3-D volumetric/cohesive crack propagation code
  • (P. Geubelle, S. Breitenfeld, et. al)
  • 3-D dendritic growth fluid solidification code
  • (J. Dantzig, J. Jeong)
  • Adaptive insertion of cohesive elements
  • Mario Zaczek, Philippe Geubelle
  • Performance data
  • Multi-Grain contact (in progress)
  • Spandan Maiti, S. Breitenfield, O. Lawlor, P.
    Guebelle
  • Using FEM framework and collision detection
  • NSF funded project
  • Space-time meshes
  • Did initial parallelization in 4 days

130
Performance data ASCI Red
Mesh with 3.1 million elements
Speedup of 1155 on 1024 processors.
131
Dendritic Growth
  • Studies evolution of solidification
    microstructures using a phase-field model
    computed on an adaptive finite element grid
  • Adaptive refinement and coarsening of grid
    involves re-partitioning

Jon Dantzig et al with O. Lawlor and Others from
PPL
132
Overhead of Multipartitioning
Conclusion Overhead of virtualization is small,
and in fact it benefits by creating automatic
133
Parallel Collision Detection
  • Detect collisions (intersections) between objects
    scattered across processors
  • Approach, based on Charm Arrays
  • Overlay regular, sparse 3D grid of voxels (boxes)
  • Send objects to all voxels they touch
  • Collide objects within each voxel independently
    and collect results
  • Leave collision response to user code

134
Parallel Collision Detection
  • Results 2?s per polygon
  • Good speedups to 1000s of processors

ASCI Red, 65,000 polygons per processor. (scaled
problem) Up to 100 million polygons
  • This was a significant improvement over the
    state-of-art.
  • Made possible by virtualization, and
  • Asynchronous, as needed, creation of voxels
  • Localization of communication voxel often on the
    same processor as the contributing polygon

135
Summary
  • Processor virtualization is a powerful techniques
  • Charm/AMPI are production quality systems
  • with bells and whistles
  • Can scale to petaFLOPS class machines
  • Domain-specific frameworks
  • Can raise the level of abstraction and promote
    reuse
  • Unstructured Mesh framework
  • Next compiler support, new coordination
    mechanisms
  • Software available
  • http//charm.cs.uiuc.edu

136
Optimizing for Communication Patterns
  • The parallel-objects Runtime System can observe,
    instrument, and measure communication patterns
  • Communication is from/to objects, not processors
  • Load balancers can use this to optimize object
    placement
  • Communication libraries can optimize
  • By substituting most suitable algorithm for each
    operation
  • Learning at runtime

V. Krishnan, MS Thesis, 1996
137
Molecular Dynamics Benefits of avoiding barrier
  • In NAMD
  • The energy reductions were made asynchronous
  • No other global barriers are used in cut-off
    simulations
  • This came handy when
  • Running on Pittsburgh Lemieux (3000 processors)
  • The machine ( our way of using the communication
    layer) produced unpredictable, random delays in
    communication
  • A send call would remain stuck for 20 ms, for
    example
  • How did the system handle it?
  • See timeline plots

138
Golden Rule of Load Balancing
Fallacy objective of load balancing is to
minimize variance in load across processors
Example 50,000 tasks of equal size, 500
processors A All processors get 99,
except last 5 gets 10099 199 OR, B All
processors have 101, except last 5 get 1
Identical variance, but situation A is much worse!
Golden Rule It is ok if a few processors idle,
but avoid having processors that are overloaded
with work
Finish time maxTime on Ith processorExcepting
data dependence and communication overhead issues
139
Amdahlss Law and grainsize
  • Before we get to load balancing
  • Original law
  • If a program has K sequential section, then
    speedup is limited to 100/K.
  • If the rest of the program is parallelized
    completely
  • Grainsize corollary
  • If any individual piece of work is gt K time
    units, and the sequential program takes Tseq ,
  • Speedup is limited to Tseq / K
  • So
  • Examine performance data via histograms to find
    the sizes of remappable work units
  • If some are too big, change the decomposition
    method to make smaller units

140
Grainsize LeanMD for Blue Gene/L
  • BG/L is a planned IBM machine with 128k
    processors
  • Here, we need even more objects
  • Generalize hybrid decomposition scheme
  • 1-away to k-away

2-away cubes are half the size.
141
76,000 vps
5000 vps
256,000 vps
142
New strategy
Write a Comment
User Comments (0)
About PowerShow.com