Parallelizing the graph isomorphism portion of an automatic reaction mechanism generation algorithm - PowerPoint PPT Presentation

1 / 21

About This Presentation

Title:

Parallelizing the graph isomorphism portion of an automatic reaction mechanism generation algorithm

Description:

64-node cluster (pharos.mit.edu) 8 GB RAM per node ... Difficulties installing igraph and mpi4py on pharos. System libraries had to be recompiled ... – PowerPoint PPT presentation

Number of Views:72

Avg rating:3.0/5.0

Slides: 22

Provided by: geoffrey79

Category:

more less

Transcript and Presenter's Notes

Title: Parallelizing the graph isomorphism portion of an automatic reaction mechanism generation algorithm

1
Parallelizing the graph isomorphism portion of an
automatic reaction mechanism generation algorithm

Geoff Oxberry
18.337 Project, Spring 2009

2
Automatic reaction mechanism generation yields
models quickly

Reaction mechanisms are used to model chemistry
in a wide range of applications
Generating first principles reaction mechanisms
can take years, requires lots of expertise
The Bill Green group developed software (RMG)
that automatically generates these models based
on rules

3
For some problems, RMG takes days to generate a
mechanism

We want it to take a day or less on a cluster
Big bottleneck for us is that we have to
repeatedly solve a colored graph isomorphism (GI)
problem
If we can speed it up, we can solve many more
interesting chemistry problems
Parallelism is one option

4
We want to see if parallelism can be used to
speed up RMG

Want to see if a parallel version of RMG is
faster than a serial version
Due to time constraints, I chose to implement
skeletal prototypes of serial and parallel
versions of RMG in Python
Idea is to use results for the prototypes to see
if it is worth parallelizing the production-scale
code

5
Parallelism does speed up RMG on
intermediate-sized case studies

When searching for graph isomorphisms in
collections of 20 or fewer graphs, serial code is
faster
When searching for graph isomorphisms in
collections of 100 graphs, parallel code is
faster
When searching for graph isomorphisms in
collections of 2000 graphs, serial code is
faster again

6
Outline

Brief overview of graph isomorphism
Discussion of existing RMG algorithm and how to
parallelize
Python prototypes of serial and parallel versions
of RMG algorithm
Results
Discussion of obstacles
Conclusions

7
Two graphs are isomorphic if there exists a
bijection between their nodes

These two graphs are isomorphic

3
1
5
2
3
2
4
1
5
4

Bijection here (L-R) 1-1, 3-4, 2-5, 5-2, 4-3

8
In RMG, ChemGraphsrepresent species

ChemGraphs are graphs with node labels and edge
labels
Species are represented by a class of graphs
equivalent under isomorphism
Example (methane)

3
1
5
Node labels refer to atom types, edge labels
refer to bond types
2
1
5
2
4
3
4
9
RMG classifies species as one of three types

Core species make up all of the reactants of the
reaction mechanism
Edge species are products of the reaction
mechanism not included in the core they may be
added to the core over the course of the
algorithm
Postulated species are proposed species that may
be added to the edge over the course of the
algorithm

10
RMG algorithm manipulates graphs to generate a
reaction mechanism
Initialize set of core species
Generate postulated speciesusing some rules.
No
Use GI to discard postulatedspecies based on
various criteria
Is terminationcriteria met?
Add remaining postulatedspecies to edge species.
Yes
Determine if any edge speciesshould be added to
core.
11
Checking for duplicate graphs using GI looks
parallelizable

For example, could scatter postulated species
over all processors and check for duplicates
against core species in parallel
Could also do this with forbidden configs, etc.

Use GI to discard postulatedspecies based on
various criteria
Use GI to check for forbidden configurations.
Use GI to check for duplicatesamong postulated
species.
Use GI to check that postulatedspecies arent
duplicated in core.
Discard any duplicates.
12
Instead of working with RMG directly, I created a
prototype

RMG takes 18 mos. for a developer to get up to
speed this project was 6 wks.
To save time, I built a prototype in Python
because its syntax and available libraries enable
rapid development
Also enabled me to focus on the parts of the code
that matter (GI algorithms) and ignore the rest

13
Serial prototype throws out everything but GI
checking
Initialize set of core species
Select postulated speciesfrom existing RMG
output.
No
Use GI to discard postulatedspecies based on
various criteria
Is terminationcriteria met?
Add remaining postulatedspecies to core species.
Yes
14
Parallel prototype parallelizes part of the GI
comparisons

Checking postulated species against core species
is embarrassingly parallel
Postulated species are essentially independent in
that step

Use GI to discard postulatedspecies based on
various criteria (in prototype)
Use GI to check for duplicatesamong postulated
species.
Use GI in parallel to check thatpostulated
species arentduplicated in core.
Discard any duplicates.
15
Prototypes were implementedin Python/MPI on a
cluster

Software
Python 2.5 (w/ C extensions)
igraph module (graph data structure, GI
algorithms)
mpi4py module (MPI bindings for Python)

Hardware
64-node cluster (pharos.mit.edu)
8 GB RAM per node
Each node has 2 quad-core Xeon processors (either
2.33 GHz or 2.66 GHz)

16
Parallel prototype was faster on
intermediate-sized problems

Species database was obtained from existing RMG
output
Initial set of core species was 50 of database,
randomly chosen
Program ran until all species in database were
moved into core, or it reached 100 iterations

17
Communication is slow in large test cases due to
passing graph objects

Graphs are implemented using a class in the
igraph library
mpi4py converts non-native Python objects using
cPickle, which is compute-intensive
cPickle is probably why the serial code is faster
in large test cases
Alternative approach would use NumPy and define
an MPI derived data type would be faster

18
Many technical problems occurred during the
project

Laptop experienced hardware failures
Difficulties installing igraph and mpi4py on
pharos
System libraries had to be recompiled
Environment variables were reset so igraph and
mpi4py could be recognized on all nodes
Incomplete mpi4py documentation
Python extended debugger not installed no
graphical front-end

19
Parallelism can be used to speed up RMG for some
case studies

Saw speed up for intermediate-sized case studies
on parallel prototype
Additional opportunities for parallelism within
RMG algorithm
Can also decrease MPI communication costs w/
additional development, use of debugger/profiler

20
Future Work