Title: Parallelizing the graph isomorphism portion of an automatic reaction mechanism generation algorithm
1Parallelizing the graph isomorphism portion of an
automatic reaction mechanism generation algorithm
- Geoff Oxberry
- 18.337 Project, Spring 2009
2Automatic reaction mechanism generation yields
models quickly
- Reaction mechanisms are used to model chemistry
in a wide range of applications - Generating first principles reaction mechanisms
can take years, requires lots of expertise - The Bill Green group developed software (RMG)
that automatically generates these models based
on rules
3For some problems, RMG takes days to generate a
mechanism
- We want it to take a day or less on a cluster
- Big bottleneck for us is that we have to
repeatedly solve a colored graph isomorphism (GI)
problem - If we can speed it up, we can solve many more
interesting chemistry problems - Parallelism is one option
4We want to see if parallelism can be used to
speed up RMG
- Want to see if a parallel version of RMG is
faster than a serial version - Due to time constraints, I chose to implement
skeletal prototypes of serial and parallel
versions of RMG in Python - Idea is to use results for the prototypes to see
if it is worth parallelizing the production-scale
code
5Parallelism does speed up RMG on
intermediate-sized case studies
- When searching for graph isomorphisms in
collections of 20 or fewer graphs, serial code is
faster - When searching for graph isomorphisms in
collections of 100 graphs, parallel code is
faster - When searching for graph isomorphisms in
collections of 2000 graphs, serial code is
faster again
6Outline
- Brief overview of graph isomorphism
- Discussion of existing RMG algorithm and how to
parallelize - Python prototypes of serial and parallel versions
of RMG algorithm - Results
- Discussion of obstacles
- Conclusions
7Two graphs are isomorphic if there exists a
bijection between their nodes
- These two graphs are isomorphic
3
1
5
2
3
2
4
1
5
4
- Bijection here (L-R) 1-1, 3-4, 2-5, 5-2, 4-3
8In RMG, ChemGraphsrepresent species
- ChemGraphs are graphs with node labels and edge
labels - Species are represented by a class of graphs
equivalent under isomorphism - Example (methane)
3
1
5
Node labels refer to atom types, edge labels
refer to bond types
2
1
5
2
4
3
4
9RMG classifies species as one of three types
- Core species make up all of the reactants of the
reaction mechanism - Edge species are products of the reaction
mechanism not included in the core they may be
added to the core over the course of the
algorithm - Postulated species are proposed species that may
be added to the edge over the course of the
algorithm
10RMG algorithm manipulates graphs to generate a
reaction mechanism
Initialize set of core species
Generate postulated speciesusing some rules.
No
Use GI to discard postulatedspecies based on
various criteria
Is terminationcriteria met?
Add remaining postulatedspecies to edge species.
Yes
Determine if any edge speciesshould be added to
core.
11Checking for duplicate graphs using GI looks
parallelizable
- For example, could scatter postulated species
over all processors and check for duplicates
against core species in parallel - Could also do this with forbidden configs, etc.
Use GI to discard postulatedspecies based on
various criteria
Use GI to check for forbidden configurations.
Use GI to check for duplicatesamong postulated
species.
Use GI to check that postulatedspecies arent
duplicated in core.
Discard any duplicates.
12Instead of working with RMG directly, I created a
prototype
- RMG takes 18 mos. for a developer to get up to
speed this project was 6 wks. - To save time, I built a prototype in Python
because its syntax and available libraries enable
rapid development - Also enabled me to focus on the parts of the code
that matter (GI algorithms) and ignore the rest
13Serial prototype throws out everything but GI
checking
Initialize set of core species
Select postulated speciesfrom existing RMG
output.
No
Use GI to discard postulatedspecies based on
various criteria
Is terminationcriteria met?
Add remaining postulatedspecies to core species.
Yes
14Parallel prototype parallelizes part of the GI
comparisons
- Checking postulated species against core species
is embarrassingly parallel - Postulated species are essentially independent in
that step
Use GI to discard postulatedspecies based on
various criteria (in prototype)
Use GI to check for duplicatesamong postulated
species.
Use GI in parallel to check thatpostulated
species arentduplicated in core.
Discard any duplicates.
15Prototypes were implementedin Python/MPI on a
cluster
- Software
- Python 2.5 (w/ C extensions)
- igraph module (graph data structure, GI
algorithms) - mpi4py module (MPI bindings for Python)
- Hardware
- 64-node cluster (pharos.mit.edu)
- 8 GB RAM per node
- Each node has 2 quad-core Xeon processors (either
2.33 GHz or 2.66 GHz)
16Parallel prototype was faster on
intermediate-sized problems
- Species database was obtained from existing RMG
output - Initial set of core species was 50 of database,
randomly chosen - Program ran until all species in database were
moved into core, or it reached 100 iterations
17Communication is slow in large test cases due to
passing graph objects
- Graphs are implemented using a class in the
igraph library - mpi4py converts non-native Python objects using
cPickle, which is compute-intensive - cPickle is probably why the serial code is faster
in large test cases - Alternative approach would use NumPy and define
an MPI derived data type would be faster
18Many technical problems occurred during the
project
- Laptop experienced hardware failures
- Difficulties installing igraph and mpi4py on
pharos - System libraries had to be recompiled
- Environment variables were reset so igraph and
mpi4py could be recognized on all nodes - Incomplete mpi4py documentation
- Python extended debugger not installed no
graphical front-end
19Parallelism can be used to speed up RMG for some
case studies
- Saw speed up for intermediate-sized case studies
on parallel prototype - Additional opportunities for parallelism within
RMG algorithm - Can also decrease MPI communication costs w/
additional development, use of debugger/profiler
20Future Work
- Install extended Python debugger/profiler
- Use NumPy and MPI derived data type to reduced
communication overhead - Try alternative strategies for parallelization
- Reorganize algorithm (check core species, then
postulated species) - Parallelize checks of postulated species against
themselves
21Acknowledgments
- RMG team
- Franklin Goldsmith
- Sandeep Sharma
- Josh Allen
- Richard West
- Michael Harper
- Greg Magoon
- Ray Speth
- Kushal Kedia
- Prof. Bill Green
- DOE CSGF for funding