Parallelizing the graph isomorphism portion of an automatic reaction mechanism generation algorithm - PowerPoint PPT Presentation

1 / 21
About This Presentation
Title:

Parallelizing the graph isomorphism portion of an automatic reaction mechanism generation algorithm

Description:

64-node cluster (pharos.mit.edu) 8 GB RAM per node ... Difficulties installing igraph and mpi4py on pharos. System libraries had to be recompiled ... – PowerPoint PPT presentation

Number of Views:72
Avg rating:3.0/5.0
Slides: 22
Provided by: geoffrey79
Category:

less

Transcript and Presenter's Notes

Title: Parallelizing the graph isomorphism portion of an automatic reaction mechanism generation algorithm


1
Parallelizing the graph isomorphism portion of an
automatic reaction mechanism generation algorithm
  • Geoff Oxberry
  • 18.337 Project, Spring 2009

2
Automatic reaction mechanism generation yields
models quickly
  • Reaction mechanisms are used to model chemistry
    in a wide range of applications
  • Generating first principles reaction mechanisms
    can take years, requires lots of expertise
  • The Bill Green group developed software (RMG)
    that automatically generates these models based
    on rules

3
For some problems, RMG takes days to generate a
mechanism
  • We want it to take a day or less on a cluster
  • Big bottleneck for us is that we have to
    repeatedly solve a colored graph isomorphism (GI)
    problem
  • If we can speed it up, we can solve many more
    interesting chemistry problems
  • Parallelism is one option

4
We want to see if parallelism can be used to
speed up RMG
  • Want to see if a parallel version of RMG is
    faster than a serial version
  • Due to time constraints, I chose to implement
    skeletal prototypes of serial and parallel
    versions of RMG in Python
  • Idea is to use results for the prototypes to see
    if it is worth parallelizing the production-scale
    code

5
Parallelism does speed up RMG on
intermediate-sized case studies
  • When searching for graph isomorphisms in
    collections of 20 or fewer graphs, serial code is
    faster
  • When searching for graph isomorphisms in
    collections of 100 graphs, parallel code is
    faster
  • When searching for graph isomorphisms in
    collections of 2000 graphs, serial code is
    faster again

6
Outline
  • Brief overview of graph isomorphism
  • Discussion of existing RMG algorithm and how to
    parallelize
  • Python prototypes of serial and parallel versions
    of RMG algorithm
  • Results
  • Discussion of obstacles
  • Conclusions

7
Two graphs are isomorphic if there exists a
bijection between their nodes
  • These two graphs are isomorphic

3
1
5
2
3
2
4
1
5
4
  • Bijection here (L-R) 1-1, 3-4, 2-5, 5-2, 4-3

8
In RMG, ChemGraphsrepresent species
  • ChemGraphs are graphs with node labels and edge
    labels
  • Species are represented by a class of graphs
    equivalent under isomorphism
  • Example (methane)

3
1
5
Node labels refer to atom types, edge labels
refer to bond types
2
1
5
2
4
3
4
9
RMG classifies species as one of three types
  • Core species make up all of the reactants of the
    reaction mechanism
  • Edge species are products of the reaction
    mechanism not included in the core they may be
    added to the core over the course of the
    algorithm
  • Postulated species are proposed species that may
    be added to the edge over the course of the
    algorithm

10
RMG algorithm manipulates graphs to generate a
reaction mechanism
Initialize set of core species
Generate postulated speciesusing some rules.
No
Use GI to discard postulatedspecies based on
various criteria
Is terminationcriteria met?
Add remaining postulatedspecies to edge species.
Yes
Determine if any edge speciesshould be added to
core.
11
Checking for duplicate graphs using GI looks
parallelizable
  • For example, could scatter postulated species
    over all processors and check for duplicates
    against core species in parallel
  • Could also do this with forbidden configs, etc.

Use GI to discard postulatedspecies based on
various criteria
Use GI to check for forbidden configurations.
Use GI to check for duplicatesamong postulated
species.
Use GI to check that postulatedspecies arent
duplicated in core.
Discard any duplicates.
12
Instead of working with RMG directly, I created a
prototype
  • RMG takes 18 mos. for a developer to get up to
    speed this project was 6 wks.
  • To save time, I built a prototype in Python
    because its syntax and available libraries enable
    rapid development
  • Also enabled me to focus on the parts of the code
    that matter (GI algorithms) and ignore the rest

13
Serial prototype throws out everything but GI
checking
Initialize set of core species
Select postulated speciesfrom existing RMG
output.
No
Use GI to discard postulatedspecies based on
various criteria
Is terminationcriteria met?
Add remaining postulatedspecies to core species.
Yes
14
Parallel prototype parallelizes part of the GI
comparisons
  • Checking postulated species against core species
    is embarrassingly parallel
  • Postulated species are essentially independent in
    that step

Use GI to discard postulatedspecies based on
various criteria (in prototype)
Use GI to check for duplicatesamong postulated
species.
Use GI in parallel to check thatpostulated
species arentduplicated in core.
Discard any duplicates.
15
Prototypes were implementedin Python/MPI on a
cluster
  • Software
  • Python 2.5 (w/ C extensions)
  • igraph module (graph data structure, GI
    algorithms)
  • mpi4py module (MPI bindings for Python)
  • Hardware
  • 64-node cluster (pharos.mit.edu)
  • 8 GB RAM per node
  • Each node has 2 quad-core Xeon processors (either
    2.33 GHz or 2.66 GHz)

16
Parallel prototype was faster on
intermediate-sized problems
  • Species database was obtained from existing RMG
    output
  • Initial set of core species was 50 of database,
    randomly chosen
  • Program ran until all species in database were
    moved into core, or it reached 100 iterations

17
Communication is slow in large test cases due to
passing graph objects
  • Graphs are implemented using a class in the
    igraph library
  • mpi4py converts non-native Python objects using
    cPickle, which is compute-intensive
  • cPickle is probably why the serial code is faster
    in large test cases
  • Alternative approach would use NumPy and define
    an MPI derived data type would be faster

18
Many technical problems occurred during the
project
  • Laptop experienced hardware failures
  • Difficulties installing igraph and mpi4py on
    pharos
  • System libraries had to be recompiled
  • Environment variables were reset so igraph and
    mpi4py could be recognized on all nodes
  • Incomplete mpi4py documentation
  • Python extended debugger not installed no
    graphical front-end

19
Parallelism can be used to speed up RMG for some
case studies
  • Saw speed up for intermediate-sized case studies
    on parallel prototype
  • Additional opportunities for parallelism within
    RMG algorithm
  • Can also decrease MPI communication costs w/
    additional development, use of debugger/profiler

20
Future Work
  • Install extended Python debugger/profiler
  • Use NumPy and MPI derived data type to reduced
    communication overhead
  • Try alternative strategies for parallelization
  • Reorganize algorithm (check core species, then
    postulated species)
  • Parallelize checks of postulated species against
    themselves

21
Acknowledgments
  • RMG team
  • Franklin Goldsmith
  • Sandeep Sharma
  • Josh Allen
  • Richard West
  • Michael Harper
  • Greg Magoon
  • Ray Speth
  • Kushal Kedia
  • Prof. Bill Green
  • DOE CSGF for funding
Write a Comment
User Comments (0)
About PowerShow.com