Optimizing Quantum Chemistry using Charm - PowerPoint PPT Presentation

1 / 28
About This Presentation
Title:

Optimizing Quantum Chemistry using Charm

Description:

Glenn Martyna (IBM TJ Watson) Mark Tuckerman (NYU) Nick Nystrom (PSU) ... Computation and communication can be overlapped (between VPs) ... – PowerPoint PPT presentation

Number of Views:57
Avg rating:3.0/5.0
Slides: 29
Provided by: charmC
Learn more at: http://charm.cs.uiuc.edu
Category:

less

Transcript and Presenter's Notes

Title: Optimizing Quantum Chemistry using Charm


1
Optimizing Quantum Chemistry using Charm
Eric Bohm http//charm.cs.uiuc.edu Parallel
Programming Laboratory Department of Computer
Science University of Illinois at Urbana Champaign
2
Overview
  • Decomposition
  • State Planes
  • 3d FFT
  • 3d matrix multiply
  • Utilizing Charm
  • Prioritized nonlocal
  • Commlib
  • Projections
  • CPMD
  • 9 phases
  • Charm applicability
  • Overlap
  • Decomposition
  • Portability
  • Communication Optimization

3
Quantum Chemistry
  • LeanCP Collaboration
  • Glenn Martyna (IBM TJ Watson)
  • Mark Tuckerman (NYU)
  • Nick Nystrom (PSU)
  • PPL Kale, Shi, Bohm, Pauli, Kumar (now at IBM),
    Vadali
  • CPMD Method
  • Plane wave QM
  • Charm Parallelization
  • PINY MD Physics engine

4
CPMD on Charm
  • 11 Charm Arrays
  • 4 Charm Modules
  • 13 Charm Groups
  • 3 Commlib strategies
  • BLAS
  • FFTW
  • PINY MD
  • Adaptive Overlap
  • Prioritized computation for phased application
  • Communication optimization
  • Load balancing
  • Group caches
  • Rth Threads

5
Practical Scaling
  • Single Wall Carbon Nanotube Field Effect
    Transistor
  • BG/L Performance

6
Computation Flow
7
Charm
  • Uses the approach of virtualization
  • Divide the work into VPs
  • Typically much more than proc
  • Schedule each VP for execution
  • Advantage
  • Computation and communication can be overlapped
    (between VPs)
  • Number of VPs can be independent of proc
  • Other load balancing, checkpointing, etc.

8
Decomposition
  • Higher degree of virtualization better for
    Charm
  • Real Space State Planes, Gspace State Planes, Rho
    Real and Rho G, S-Calculators for each gspace
    state plane.
  • Tens of thousands of chares for a 32 mol problem
  • Careful scheduling to maximize efficiency
  • Most of the computation is in FFTs and Matrix
    Multiplies

9
3-D FFT Implementation
Dense 3-D FFT
Sparse 3-D FFT
10
Parallel FFT Library
  • Slab-based parallelization
  • We do not re-implement the sequential routine
  • Utilize 1-D and 2-D FFT routines provided by FFTW
  • Allow for
  • Multiple 3-D FFTs simultaneously
  • Multiple data sets within the same set of slab
    objects
  • Useful as 3-D FFTs are frequently used in CP
    computations

11
Multiple Parallel 3-D FFTs
12
Matrix Multiply
  • AKA Scalculator or Pair Calculator
  • Decompose state-plane values into smaller
    objects.
  • Use DGEMM on smaller sub-matrices
  • Sum together via reduction back to Gspace

13
Matrix Multiply VP based approach
14
Charm Tricks and Tips
  • Message driven execution and high degree of
    virtualization present tuning challenges
  • Flow of control using Rth-Threads
  • Prioritized messages
  • Commlib framework
  • Charm arrays vs groups
  • Problem identification with projections
  • Problem isolation techniques

15
Flow Control in Parallel
  • Rth Threads
  • Based on Duff's device these are user level
    threads with negligible overhead.
  • Essentially Goto and Return without readability
    loss
  • Allow for an event loop style of programming
  • Makes flow of control explicit
  • Uses familiar threading semantic

16
Rth Threads for Flow Control
17
Prioritized Messages for Overlap
18
Communication Library
  • Fine grained decomposition can result in many
    small messages.
  • Message combining via the Commlib framework in
    Charm addresses this problem.
  • Streaming protocol optimizes many to many
    personalized.
  • Forwarding protocols like Ring or Multiring can
    be beneficial.
  • But not on BG/L

19
Commlib Strategy Selection
20
Bound Arrays
  • Why?
  • Efficiency and clarity of expression.
  • Two arrays of the same dimensionality where like
    indices are co-placed.
  • Gspace and the non-local computation both have
    plane based computations and share many data
    elements.
  • Use ck-local to access elements, like local
    functions and local function calls.
  • Remain distinct parallel objects

21
Group Caching Techniques
  • Group objects have 1 element per processor)
  • Making excellent cache points for arrays which
    may have many chares per processor
  • Place low volatility data in the group
  • Array elements use cklocal to access
  • In CPMD the Structure Factor for all chares
    which have plane P use the same memory

22
Charm Performance Debugging
  • Complex parallel applications hard to debug
  • Event based model with high degree of
    virtualization presents new challenges
  • Projections and Charm debugger Tools
  • Bottleneck identification
  • using the Projections Usage Profile tool

23
Old S-gtT Orthonormalization
24
After Parallel S-gtT
25
Problem isolation techniques
  • Using Rth threads its easy to isolate phases by
    adding a barrier.
  • Contribute to Reduction -gt suspend
  • Reduction proxy is broadcast client -gtresume
  • In the following example we break up the Gspace
    IFFT into computation and communication entry
    methods.
  • We then insert a barrier between them to
    highlight a specific performance problem

26
Projections Timeline Analysis
27
Future Work
  • Scaling to 20k processors on BG/L - density
    pencil ffts
  • Rhospace real-gtcomplex doublepack optimization
  • New FFT based algorithm for Structure Factor
  • More systems
  • Topology aware chare mapping
  • HLL Orchestration expression

28
What time is it in Scotland?
  • There is a 1024 node BG/L in Edinburg
  • Time is 6 hours ahead of CT there.
  • During this non production time we can run on the
    full rack at night
  • Thank you EPCC!
Write a Comment
User Comments (0)
About PowerShow.com