Optimizing Quantum Chemistry using Charm

About This Presentation

Title:

Optimizing Quantum Chemistry using Charm

Description:

Glenn Martyna (IBM TJ Watson) Mark Tuckerman (NYU) Nick Nystrom (PSU) ... Computation and communication can be overlapped (between VPs) ... – PowerPoint PPT presentation

Number of Views:57

Avg rating:3.0/5.0

Slides: 29

Provided by: charmC

Learn more at: http://charm.cs.uiuc.edu

Category:

more less

Transcript and Presenter's Notes

Title: Optimizing Quantum Chemistry using Charm

1
Optimizing Quantum Chemistry using Charm
Eric Bohm http//charm.cs.uiuc.edu Parallel
Programming Laboratory Department of Computer
Science University of Illinois at Urbana Champaign
2
Overview

Decomposition
State Planes
3d FFT
3d matrix multiply
Utilizing Charm
Prioritized nonlocal
Commlib
Projections

CPMD
9 phases
Charm applicability
Overlap
Decomposition
Portability
Communication Optimization

3
Quantum Chemistry

LeanCP Collaboration
Glenn Martyna (IBM TJ Watson)
Mark Tuckerman (NYU)
Nick Nystrom (PSU)
PPL Kale, Shi, Bohm, Pauli, Kumar (now at IBM),
Vadali
CPMD Method
Plane wave QM
Charm Parallelization
PINY MD Physics engine

4
CPMD on Charm

11 Charm Arrays
4 Charm Modules
13 Charm Groups
3 Commlib strategies
BLAS
FFTW
PINY MD

Adaptive Overlap
Prioritized computation for phased application
Communication optimization
Load balancing
Group caches
Rth Threads

5
Practical Scaling

Single Wall Carbon Nanotube Field Effect
Transistor
BG/L Performance

6
Computation Flow
7
Charm

Uses the approach of virtualization
Divide the work into VPs
Typically much more than proc
Schedule each VP for execution
Advantage
Computation and communication can be overlapped
(between VPs)
Number of VPs can be independent of proc
Other load balancing, checkpointing, etc.

8
Decomposition

Higher degree of virtualization better for
Charm
Real Space State Planes, Gspace State Planes, Rho
Real and Rho G, S-Calculators for each gspace
state plane.
Tens of thousands of chares for a 32 mol problem
Careful scheduling to maximize efficiency
Most of the computation is in FFTs and Matrix
Multiplies

9
3-D FFT Implementation
Dense 3-D FFT
Sparse 3-D FFT
10
Parallel FFT Library

Slab-based parallelization
We do not re-implement the sequential routine
Utilize 1-D and 2-D FFT routines provided by FFTW
Allow for
Multiple 3-D FFTs simultaneously
Multiple data sets within the same set of slab
objects
Useful as 3-D FFTs are frequently used in CP
computations

11
Multiple Parallel 3-D FFTs
12
Matrix Multiply

AKA Scalculator or Pair Calculator
Decompose state-plane values into smaller
objects.
Use DGEMM on smaller sub-matrices
Sum together via reduction back to Gspace

13
Matrix Multiply VP based approach
14
Charm Tricks and Tips

Message driven execution and high degree of
virtualization present tuning challenges
Flow of control using Rth-Threads
Prioritized messages
Commlib framework
Charm arrays vs groups
Problem identification with projections
Problem isolation techniques

15
Flow Control in Parallel

Rth Threads
Based on Duff's device these are user level
threads with negligible overhead.
Essentially Goto and Return without readability
loss
Allow for an event loop style of programming
Makes flow of control explicit
Uses familiar threading semantic

16
Rth Threads for Flow Control
17
Prioritized Messages for Overlap
18
Communication Library

Fine grained decomposition can result in many
small messages.
Message combining via the Commlib framework in
Charm addresses this problem.
Streaming protocol optimizes many to many
personalized.
Forwarding protocols like Ring or Multiring can
be beneficial.
But not on BG/L

19
Commlib Strategy Selection
20
Bound Arrays

Why?
Efficiency and clarity of expression.
Two arrays of the same dimensionality where like
indices are co-placed.
Gspace and the non-local computation both have
plane based computations and share many data
elements.
Use ck-local to access elements, like local
functions and local function calls.
Remain distinct parallel objects

21
Group Caching Techniques

Group objects have 1 element per processor)
Making excellent cache points for arrays which
may have many chares per processor
Place low volatility data in the group
Array elements use cklocal to access
In CPMD the Structure Factor for all chares
which have plane P use the same memory

22
Charm Performance Debugging

Complex parallel applications hard to debug
Event based model with high degree of
virtualization presents new challenges
Projections and Charm debugger Tools
Bottleneck identification
using the Projections Usage Profile tool

23
Old S-gtT Orthonormalization
24
After Parallel S-gtT
25
Problem isolation techniques

Using Rth threads its easy to isolate phases by
adding a barrier.
Contribute to Reduction -gt suspend
Reduction proxy is broadcast client -gtresume
In the following example we break up the Gspace
IFFT into computation and communication entry
methods.
We then insert a barrier between them to
highlight a specific performance problem

26
Projections Timeline Analysis
27
Future Work