Title: Memory Architectures for Protein Folding: MD on million PIM processors
1Memory Architectures for Protein Folding MD on
million PIM processors
2Overview
- EIA-0081307 ITR Intelligent Memory
Architectures and Algorithms to Crack the Protein
Folding Problem - PIs
- Josep Torrellas and Laxmikant Kale (University of
Illinois) - Mark Tuckerman (New York University)
- Michael Klein (University of Pennsylvania)
- Also associated Glenn Martyna (IBM)
- Period 8/00 - 7/03
3Project Description
- Multidisciplinary project in computer
architecture and software, and computational
biology - Goals
- Design improved algorithms to help solve the
protein folding problem - Design the architecture and software of
general-purpose parallel machines that speed-up
the solution of the problem
4Some Recent Progress Ideas
- Developed REPSWA
- (Reference Potential Spatial Warping Algorithm)
- Novel algorithm for accelerating conformational
sampling in molecular dynamics, a key element in
protein folding - Based on spatial warping'' variable
transformation. - This transformation is designed to shrink barrier
regions on the energy landscape and grow
attractive basins without altering the
equilibrium properties of the system - Result large gains in sampling efficiency
- Using novel variable transformations to enhance
conformational sampling in molecular dynamics Z.
Zhu, M. E. Tuckerman, S. O. Samuelson and G. J.
Martyna, Phys. Rev. Lett. 88, 100201 (2002).
5Some Recent Progress Tools
- Developed LeanMD, a molecular dynamics parallel
program that targets at very large scale parallel
machines - Research-quality program based on the Charm
parallel object oriented language - Descendant from NAMD (another parallel molecular
dynamics application) that achieved unprecedented
speedup on thousands of processors - LeanMD to be able to run on next generation
parallel machines with ten thousands or even
millions of processors such as Blue Gene/L or
Blue Gene/C - Requires a new parallelization strategy that can
break up the simulation problem in a more fine
grained manner to generate parallelism enough to
effectively distribute work across a million
processors.
6Some Recent Progress Tools
- Developed a high-performance communication
library - For collective communication operations
- AlltoAll personalized communication, AlltoAll
multicast, and AllReduce - These operations can be complex and time
consuming in large parallel machines - Especially costly for applications that involve
all-to-all patterns - such as 3-D FFT and sorting
- Library optimizes collective communication
operations - by performing message combining via imposing a
virtual topology - The overhead of AlltoAll communication for
76-byte message exchanges between 2058 processors
is in the low tens of milliseconds
7Some Recent Progress People
- The following graduate student researchers have
been supported - Sameer Kumar (University of Illinois)
- Gengbin Zheng (University of Illinois)
- Jun Nakano (University of Illinois)
- Zhongwei Zhu (New York University)
8Overview
- Rest of the talk
- Objective Develop a Molecular Dynamics program
that will run effectively on a million processors
- Each with low memory to processor ratio
- Method
- Use parallel objects methodology
- Develop an emulator/simulator that allows one to
run full-fledged programs on simulated
architecture - Presenting Today
- Simulator details
- LeanMD Simulation on BG/L and BG/C
9Performance Prediction on Large Machines
- Problem
- How to predict performance of applications on
future machines? - How to do performance tuning without continuous
access to a large machine?
- Solution
- Leverage virtualization
- Develop a machine emulator
- Simulator accurate time modeling
- Run a program on 100,000 processors using only
hundreds of processors
10Blue Gene Emulator functional view
Affinity message queues
Affinity message queues
Converse scheduler
Converse Q
11Emulator to Simulator
- Emulator
- Study programming model and application
development - Simulator
- performance prediction capability
- models communication latency based on network
model - Doesnt model memory access on chip, or network
contention
- Parallel performance is hard to model
- Communication subsystem
- Out of order messages
- Communication/computation overlap
- Event dependencies
- Parallel Discrete Event Simulation
- Emulation program executes in parallel with event
time stamp correction. - Exploit inherent determinacy of application
12How to simulate?
- Time stamping events
- Per thread timer (sharing one physical timer)
- Time stamp messages
- Calculate communication latency based on network
model - Parallel event simulation
- When a message is sent out, calculate the
predicted arrival time for the destination
bluegene-processor - When a message is received, update current time
as - currTime max(currTime,recvTime)
- Time stamp correction
13Parallel correction algorithm
- Sort message execution by receive time
- Adjust time stamps when needed
- Use correction message to inform the change in
event startTime. - Send out correction messages following the path
message was sent - The events already in the timeline may have to
move.
14Timestamps Correction
15Timestamps Correction
16Timestamps Correction
17Timestamps Correction
18Predicted time vs latency factor
Validation
19LeanMD
- LeanMD is a molecular dynamics simulation
application written in Charm - Next generation of NAMD,
- The Gordon Bell Award winner in SC2002.
- Requires a new parallelization strategy
- break up the problem in a more fine-grained
manner to effectively distribute work across the
extreme large number of processors.
20LeanMD Performance Analysis
Need readable graphs 1 to a page is fine, but
with larger fonts, thicker lines
21(No Transcript)