Adaptive MPI: Intelligent runtime strategies and performance prediction via simulation - PowerPoint PPT Presentation

Loading...

PPT – Adaptive MPI: Intelligent runtime strategies and performance prediction via simulation PowerPoint presentation | free to download - id: e102d-MTk0O



Loading


The Adobe Flash plugin is needed to view this content

Get the plugin now

View by Category
About This Presentation
Title:

Adaptive MPI: Intelligent runtime strategies and performance prediction via simulation

Description:

... UTK. 1. Adaptive MPI: Intelligent runtime strategies and performance prediction via ... technology of parallel objects and intelligent Runtime systems has led to ... – PowerPoint PPT presentation

Number of Views:82
Avg rating:3.0/5.0
Slides: 62
Provided by: san7196
Learn more at: http://charm.cs.uiuc.edu
Category:

less

Write a Comment
User Comments (0)
Transcript and Presenter's Notes

Title: Adaptive MPI: Intelligent runtime strategies and performance prediction via simulation


1
Adaptive MPI Intelligent runtime strategies  and
performance prediction via simulation
  • Laxmikant Kale
  • http//charm.cs.uiuc.edu
  • Parallel Programming Laboratory
  • Department of Computer Science
  • University of Illinois at Urbana Champaign

2
PPL Mission and Approach
  • To enhance Performance and Productivity in
    programming complex parallel applications
  • Performance scalable to thousands of processors
  • Productivity of human programmers
  • Complex irregular structure, dynamic variations
  • Approach Application Oriented yet CS centered
    research
  • Develop enabling technology, for a wide
    collection of apps.
  • Develop, use and test it in the context of real
    applications
  • How?
  • Develop novel Parallel programming techniques
  • Embody them into easy to use abstractions
  • So, application scientist can use advanced
    techniques with ease
  • Enabling technology reused across many apps

3
Develop abstractions in context of full-scale
applications
Protein Folding
Quantum Chemistry (QM/MM)
Molecular Dynamics
Computational Cosmology
Parallel Objects, Adaptive Runtime System
Libraries and Tools
Crack Propagation
Space-time meshes
Dendritic Growth
Rocket Simulation
The enabling CS technology of parallel objects
and intelligent Runtime systems has led to
several collaborative applications in CSE
4
Migratable Objects (aka Processor Virtualization)
Programmer Over decomposition into virtual
processors Runtime Assigns VPs to
processors Enables adaptive runtime
strategies Implementations Charm, AMPI
5
Outline
  • Adaptive MPI
  • Load Balancing
  • Fault tolerance
  • Projections
  • performance analysis
  • Performance prediction
  • bigsim

6
AMPI MPI with Virtualization
  • Each virtual process implemented as a user-level
    thread embedded in a Charm object

7
Making AMPI work
  • Multiple user-level threads per processor
  • Problems with global variable
  • Solution I
  • Automatic switch GOT pointer at context switch
  • Available on most machines
  • Solution 2 Manual replace global variables
  • Solution 3 automatic via compiler support
    (AMPIzer)
  • Migrating Stacks
  • Use isomalloc technique (Mehaut et al)
  • Use memory files and mmap()
  • Heap data
  • Isomalloc heaps
  • OR user supplied pack/unpack functions for the
    heap data

8
ELF and global variables
  • The Executable and Linking Format (ELF)
  • Executable has a Global Offset Table containing
    global data
  • GOT pointer stored at ebx register
  • Switch this pointer when switching between
    threads
  • Support on Linux, Solaris 2.x, and more
  • Integrated in Charm/AMPI
  • Invoke by compile time option -swapglobal

9
Adaptive overlap and modules
SPMD and Message-Driven Modules (From A. Gursoy,
Simplified expression of message-driven programs
and quantification of their impact on
performance, Ph.D Thesis, Apr 1994.)
Modularity, Reuse, and Efficiency with
Message-Driven Libraries Proc. of the Seventh
SIAM Conference on Parallel Processing for
Scientigic Computing, San Fransisco, 1995
10
Benefit of Adaptive Overlap
Problem setup 3D stencil calculation of size
2403 run on Lemieux. Shows AMPI with
virtualization ratio of 1 and 8.
11
Comparison with Native MPI
  • Performance
  • Slightly worse w/o optimization
  • Being improved
  • Flexibility
  • Small number of PE available
  • Special requirement by algorithm

Problem setup 3D stencil calculation of size
2403 run on Lemieux. AMPI runs on any of PEs
(eg 19, 33, 105). Native MPI needs cube .
12
AMPI Extensions
  • Automatic load balancing
  • Non-blocking collectives
  • Checkpoint/restart mechanism
  • Multi-module programming

13
Load Balancing in AMPI
  • Automatic load balancing MPI_Migrate()
  • Collective call informing the load balancer that
    the thread is ready to be migrated, if needed.

14
Load Balancing Steps
Regular Timesteps
Detailed, aggressive Load Balancing object
migration
Instrumented Timesteps
Refinement Load Balancing
15
Processor Utilization against Time on 128 and
1024 processors On 128 processor, a single load
balancing step suffices, but On 1024 processors,
we need a refinement step.
16
Shrink/Expand
  • Problem Availability of computing platform may
    change
  • Fitting applications on the platform by object
    migration

Time per step for the million-row CG solver on a
16-node cluster Additional 16 nodes available at
step 600
17
Optimized All-to-all Surprise
Completion time vs. computation overhead
76 bytes all-to-all on Lemieux
CPU is free during most of the time taken by a
collective operation
900
800
700
600
Led to the development of Asynchronous
Collectives now supported in AMPI
AAPC Completion Time(ms)
500
400
300
200
100
0
100B
200B
900B
4KB
8KB
Message Size (bytes)
Mesh
Direct
18
Asynchronous Collectives
  • Our implementation is asynchronous
  • Collective operation posted
  • Test/wait for its completion
  • Meanwhile useful computation can utilize CPU
  • MPI_Ialltoall( , req)
  • / other computation /
  • MPI_Wait(req)

19
Fault Tolerance
20
Motivation
  • Applications need fast, low cost and scalable
    fault tolerance support
  • As machines grow in size
  • MTBF decreases
  • Applications have to tolerate faults
  • Our research
  • Disk based Checkpoint/Restart
  • In Memory Double Checkpointing/Restart
  • Sender based Message Logging
  • Proactive response to fault prediction
  • (impending fault response)

21
Checkpoint/Restart Mechanism
  • Automatic Checkpointing for AMPI and Charm
  • Migrate objects to disk!
  • Automatic fault detection and restart
  • Now available in distribution version of AMPI and
    Charm
  • Blocking Co-ordinated Checkpoint
  • States of chares are checkpointed to disk
  • Collective call MPI_Checkpoint(DIRNAME)
  • The entire job is restarted
  • Virtualization allows restarting on different
    of processors
  • Runtime option
  • gt ./charmrun pgm p4 vp16 restart DIRNAME
  • Simple but effective for common cases

22
In-memory Double Checkpoint
  • In-memory checkpoint
  • Faster than disk
  • Co-ordinated checkpoint
  • Simple MPI_MemCheckpoint(void)
  • User can decide what makes up useful state
  • Double checkpointing
  • Each object maintains 2 checkpoints
  • Local physical processor
  • Remote buddy processor
  • For jobs with large memory
  • Use local disks!

32 processors with 1.5GB memory each
23
Restart
  • A Dummy process is created
  • Need not have application data or checkpoint
  • Necessary for runtime
  • Starts recovery on all other Processors
  • Other processors
  • Remove all chares
  • Restore checkpoints lost on the crashed PE
  • Restore chares from local checkpoints
  • Load balance after restart

24
Restart Performance
  • 10 crashes
  • 128 processors
  • Checkpoint every 10 time steps

25
Scalable Fault Tolerance
Motivation When a processor out of 100,000
fails, all 99,999 shouldnt have to run back to
their checkpoints!
  • How?
  • Sender-side message logging
  • Asynchronous Checkpoints on buddy processors
  • Latency tolerance mitigates costs
  • Restart can be speeded up by spreading out
    objects from failed processor
  • Current progress
  • Basic scheme idea implemented and tested in
    simple programs
  • General purpose implementation in progress

Only failed processors objects recover from
checkpoints, playing back their messages, while
others continue
26
Recovery Performance
Execution Time with increasing number of faults
on 8 processors (Checkpoint period 30s)
27
Projections Performance visualization and
analysis tool
28
An Introduction to Projections
  • Performance Analysis tool for Charm based
    applications.
  • Automatic trace instrumentation.
  • Post-mortem visualization.
  • Multiple advanced features that support
  • Data volume control
  • Generation of additional user data.

29
Trace Generation
  • Automatic instrumentation by runtime system
  • Detailed
  • In the log mode each event is recorded in full
    detail (including timestamp) in an internal
    buffer.
  • Summary
  • Reduces the size of output files and memory
    overhead.
  • Produces a few lines of output data per
    processor.
  • This data is recorded in bins corresponding to
    intervals of size 1 ms by default.
  • Flexible
  • APIs and runtime options for instrumenting user
    events and data generation control.

30
The Summary View
  • Provides a view of the overall utilization of
    the application.
  • Very quick to load.

31
Graph View
  • Features
  • Selectively view Entry points.
  • Convenient means to switch to between axes data
    types.

32
Timeline
  • The most detailed view in Projections.
  • Useful for understanding critical path issues or
    unusual entry point behaviors at specific times.

33
Animations
34
(No Transcript)
35
(No Transcript)
36
(No Transcript)
37
Time Profile
  • Identified a portion of CPAIMD (Quantum Chemistry
    code) that ran too early via the Time Profile
    tool.

Solved by prioritizing entry methods
38
(No Transcript)
39
(No Transcript)
40
Overview one line for each processor, time on
X-axis
White busy, Blackidle Red intermediate
41
A boring but good-performance overview
42
An interesting but pathetic overview
43
Stretch Removal
Histogram ViewsNumber of function executions vs.
their granularityNote log scale on Y-axis
After Optimizations About 5 large stretched
calls, largest of them much smaller, and almost
all calls take less than 3.2 ms
Before Optimizations Over 16 large stretched
calls
44
Miscellaneous Features -Color Selection
  • Colors are automatically supplied by default.
  • We allow users to select their own colors and
    save them.
  • These colors can then be restored the next time
    Projections loads.

45
User APIs
  • Controlling trace generation
  • void traceBegin()
  • void traceEnd()
  • Tracing User (Events
  • int traceRegisterUserEvent(char , int)
  • void traceUserEvent(char )
  • void traceUserBracketEvent(int, double, double)
  • double CmiWallTimer()
  • Runtime options
  • traceoff
  • traceroot ltdirectorygt
  • Projections mode only
  • logsize lt entriesgt
  • gz-trace
  • Summary mode only
  • bincount lt of intervalsgt
  • binsize ltinterval time quanta (us)gt

46
Performance Prediction Via Parallel Simulation
47
BigSim Performance Prediction
  • Extremely large parallel machines are around
    already/about to be available
  • ASCI Purple (12k, 100TF)
  • Bluegene/L (64k, 360TF)
  • Bluegene/C (1M, 1PF)
  • How to write a petascale application?
  • What will be the Performance like?
  • Would existing parallel applications scale?
  • Machines are not there
  • Parallel Performance is hard to model without
    actually running the program

48
Objectives and Simualtion Model
  • Objectives
  • Develop techniques to facilitate the development
    of efficient peta-scale applications
  • Based on performance prediction of applications
    on large simulated parallel machines
  • Simulation-based Performance Prediction
  • Focus on Charm and AMPI programming models
    Performance prediction based on PDES
  • Supports varying levels of fidelity
  • processor prediction, network prediction.
  • Modes of execution
  • online and post-mortem mode

49
Blue Gene Emulator/Simulator
  • Actually BigSim, for simulation of any large
    machine using smaller parallel machines
  • Emulator
  • Allows development of programming environment
    and algorithms before the machine is built
  • Allowed us to port Charm to real BG/L in 1-2
    days
  • Simulator
  • Allows performance analysis of programs running
    on large machines, without access to the large
    machines
  • Uses Parallel Discrete Event Simulation

50
Architecture of BigNetSim
51
Simulation Details
  • Emulate large parallel machines on smaller
    existing parallel machines run a program with
    multi-million way parallelism (implemented using
    user-threads).
  • Consider memory and stack-size limitations
  • Ensure time-stamp correction
  • Emulator layer API is built on top of machine
    layer
  • Charm/AMPI implemented on top of emulator, like
    any other machine layer
  • Emulator layer supports all Charm features
  • Load-balancing
  • Coomunication optimizations

52
Performance Prediction
  • Usefulness of performance prediction
  • Application developer (making small
    modifications)
  • Difficult to get runtimes on huge current
    machines
  • For future machines, simulation is the only
    possibility
  • Performance debugging cycle can be considerably
    reduced
  • Even approximate predictions can identify
    performance issues such as load imbalance, serial
    bottlenecks, communication bottlenecks, etc
  • Machine architecture designer
  • Knowledge of how target applications behave on
    it, can help identify problems with machine
    design early
  • Record traces during parallel emulation
  • Run trace-driven simulation (PDES)

53
Performance Prediction (contd.)
  • Predicting time of sequential code
  • User supplied time for every code block
  • Wall-clock measurements on simulating machine can
    be used via a suitable multiplier
  • Hardware performance counters to count floating
    point, integer, branch instructions, etc
  • Cache performance and memory footprint are
    approximated by percentage of memory accesses and
    cache hit/miss ratio
  • Instruction level simulation (not implemented)
  • Predicting Network performance
  • No contention, time based on topology other
    network parameters
  • Back-patching, modifies comm time using amount of
    comm activity
  • Network-simulation, modelling the netowrk entirely

54
Performance Prediction Validation
  • 7-point stencil program with 3D decomposition
  • Run on 32 real processors, simulating 64, 128,...
    PEs
  • NAMD benchmark Apo-Lipoprotein A1 atom dataset
    with 92k atoms, running for 15 timesteps
  • For large processors, because of cache and memory
    effects, the predicted value seems to diverge
    from actual value

55
Performance on Large Machines
  • Problem
  • How to predict performance of applications on
    future machines? (E.g. BG/L)
  • How to do performance tuning without continuous
    access to large machine?
  • Solution
  • Leverage virtualization
  • Develop machine emulator
  • Simulator accurate time modeling
  • Run program on 100,000 processors using only
    hundreds of processors
  • Analysis
  • Use performance viz. suite (projections)

Molecular Dynamics Benchmark ER-GRE 36573
atoms 1.6 million objects 8 step simulation 16k
BG processors
56
Projections Performance visualization
57
NetWork Simulation
  • Detailed implementation of interconnection
    networks
  • Configurable network parameters
  • Topology / Routing
  • Input / Output VC seclection
  • Bandwidth / Latency
  • NIC parameters
  • Buffer / Message size, etc
  • Support for hardware collectives in network layer

58
Higher level programming
  • Orchestration language
  • Allows expressing global control flow in a charm
    program
  • HPF like flavor, but Charm-like processor
    virtualization, and explicit communication
  • Multiphase Shared Arrays
  • Provides a disciplined use of shared address
    space
  • Each array can be accessed only in one of the
    following modes
  • ReadOnly, Write-by-One-Thread, Accumulate-only
  • Access mode can change from phase to phase
  • Phases delineated by per-array sync

59
Other projects
  • Faucets
  • Flexible cluster scheduler
  • resource management across clusters
  • Multi-cluster applications
  • Load balancing strategies
  • Commn optimization
  • POSE Parallel discrete even simulation
  • ParFUM
  • Parallel framework for Unstructured mesh
  • Invite collaborations
  • Virtualization of other languages and libraries
  • New load balancing strategies
  • Applications

60
Some Active Collaborations
  • Biophysics Molecular Dynamics (NIH, ..)
  • Long standing, 91-, Klaus Schulten, Bob Skeel
  • Gordon bell award in 2002,
  • Production program used by biophysicists
  • Quantum Chemistry (NSF)
  • QM/MM via Car-Parinello method
  • Roberto Car, Mike Klein, Glenn Martyna, Mark
    Tuckerman,
  • Nick Nystrom, Josep Torrelas, Laxmikant Kale
  • Material simulation (NSF)
  • Dendritic growth, quenching, space-time meshes,
    QM/FEM
  • R. Haber, D. Johnson, J. Dantzig,
  • Rocket simulation (DOE)
  • DOE, funded ASCI center
  • Mike Heath, 30 faculty
  • Computational Cosmology (NSF, NASA)
  • Simulation
  • Scalable Visualization
  • Others
  • Simulation of Plasma
  • Electromagnetics

61
Summary
  • We are pursuing a broad agenda
  • aimed at productivity and performance in parallel
    programming
  • Intelligent Runtime System for adaptive
    strategies
  • Charm/AMPI are production level systems
  • Support dynamic load balancing, communication
    optimizations
  • Performance prediction capabiltiees based on
    simulation
  • Basic Fault tolerance, performance viz tools are
    part of the suite
  • Application-oriented yet Computer Science
    centered research

Workshop on Charm and Applications Oct 18-20 ,
UIUC http//charm.cs.uiuc.edu
About PowerShow.com