Overview of Chemical Informatics and Cyberinfrastructure Collaboratory - PowerPoint PPT Presentation

View by Category
About This Presentation

Overview of Chemical Informatics and Cyberinfrastructure Collaboratory


Overview of Chemical Informatics and Cyberinfrastructure Collaboratory October 18 2006 Geoffrey Fox Computer Science, Informatics, Physics Pervasive Technology ... – PowerPoint PPT presentation

Number of Views:226
Avg rating:3.0/5.0


Write a Comment
User Comments (0)
Transcript and Presenter's Notes

Title: Overview of Chemical Informatics and Cyberinfrastructure Collaboratory

Overview of Chemical Informatics and
Cyberinfrastructure Collaboratory
  • October 18 2006
  • Geoffrey Fox
  • Computer Science, Informatics, Physics
  • Pervasive Technology Laboratories
  • Indiana University Bloomington IN 47401
  • gcf_at_indiana.edu
  • http//www.infomall.org
  • http//www.chembiogrid.org

  • Local Teams, successful Prototypes and
    International Collaboration set up in 3 initial
    major focus areas
  • Chemical Informatics Cyberinfrastructure/Grids
    with services, workflows and demonstration uses
    building on success in other applications (LEAD)
    and showing distributed integration of academic
    and commercial tools
  • Computational Chemistry Cyberinfrastructure/Grids
    with simulation, databases and TeraGrid use
  • Education with courses and degrees
  • Review of activities suggest we also formalize
    work in two further areas
  • Chemical Informatics Research model
    applicability and data-mining
  • Interfacing with the User - interaction tools and
    portal optimized for particular customer groups
  • Also have started an activity to identify
    customers for Cyberinfrastructure and its
    implied Chemistry eScience model

CICC Senior Personnel
  • Peter T. Cherbas
  • Mehmet M. Dalkilic
  • Charles H. Davis
  • A. Keith Dunker
  • Kelsey M. Forsythe
  • Kevin E. Gilbert
  • John C. Huffman
  • Malika Mahoui
  • Daniel J. Mindiola
  • Santiago D. Schnell
  • William Scott
  • Craig A. Stewart
  • David R. Williams
  • Geoffrey C. Fox
  • Mu-Hyun (Mookie) Baik
  • Dennis B. Gannon
  • Marlon Pierce
  • Beth A. Plale
  • Gary D. Wiggins
  • David J. Wild
  • Yuqing (Melanie) Wu

From Biology, Chemistry, Computer Science,
Informatics at IU Bloomington and IUPUI
CICC Infrastructure Vision
  • Drug Discovery and other academic chemistry and
    pharmacologyresearch will be aided by powerful
    modern information technology ChemBioGrid set up
    as distributed cyberinfrastructure in eScience
  • ChemBioGrid will provide portals (user
    interfaces) to distributed databases, results of
    high throughput screening instruments, results of
    computational chemical simulations and other
  • ChemBioGrid will provide services to manipulate
    this data and combine in workflows it will have
    convenient ways to submit and manage multiple
  • ChemBioGrid will include access to PubChem,
    PubMed, PubMed Central, the Internet and its
    derivatives like Microsoft Academic Live and
    Google Scholar
  • The services include open-source software like
    CDK, commercial code from vendors from BCI,
    OpenEye, Gaussian and Google, and any user
    contributed programs
  • ChemBioGrid will define open interfaces to use
    for a particular type of service allowing plug
    and play choice between different implementations

Chemical Informatics and Cyberinfrastucture
Collaboratory Funded by the National Institutes
of Health www.chembiogrid.org
CICC Combines Grid Computing with Chemical
Large Scale Computing Challenges
Science and Cyberinfrastructure
CICC is an NIH funded project to support chemical
informatics needs of High Throughput Cancer
Screening Centers. The NIH is creating a data
deluge of publicly available data on potential
new drugs.
Chemical Informatics is non-traditional area of
high performance computing, but many new,
challenging problems may be investigated.
NIH PubMed DataBase
OSCAR Text Analysis
Toxicity Filtering
Cluster Grouping
Initial 3D Structure Calculation
OSCAR-mined molecular signatures can be
clustered, filtered for toxicity, and docked onto
larger proteins. These are classic pleasingly
parallel tasks. Top-ranking docked molecules
can be further examined for drug potential.
Chemical informatics text analysis programs can
process 100,000s of abstracts of online
journal articles to extract chemical signatures
of potential drugs.
Molecular Mechanics Calculations
Big Red (and the TeraGrid) will also enable us to
perform time consuming, multi-stepped Quantum
Chemistry calculations on all of PubMed. Results
go back to public databases that are freely
accessible by the scientific community.
  • CICC supports the NIH mission by combining state
    of the art chemical informatics techniques with
  • World class high performance computing
  • National-scale computing resources (TeraGrid)
  • Internet-standard web services
  • International activities for service
  • Open distributed computing infrastructure for
    scientists world wide

NIH PubChem DataBase
Quantum Mechanics Calculations
IUs Varuna DataBase
POVRay Parallel Rendering
Indiana University Department of Chemistry,
School of Informatics, and Pervasive Technology
CICC Prototype Web Services
Basic cheminformatics
Key Ideas
Molecular weights Molecular formulae Tanimoto
similarity 2D Structure diagrams Molecular
descriptors 3D structures InChI
generation/search CMLRSS R and Excel
  • Add value to PubChem with additional distributed
    services and databases
  • Develop nifty ideas like VOTables
  • Wrapping existing code in web services is not
  • Provide core (CDK) services and exemplars of
    typical tools
  • Provide access to key databases via a web
    service interface
  • Provide access to major Compute Grids

Next steps?
Application based services
  • Define WSDL interfaces to enable global
    production of compatible Web services
    refine CML
  • Add more services (identify gaps)
  • Add more databases, including 3D structural info
  • Demonstrate use of services in other pipelining
    tools (KDE, Knime Pipeline Pilot already done)
  • Extend Computational Chemistry (Varuna) Services
  • Routine TeraGrid and Big Red use
  • Production on OSCAR3 CDK Gamess Jaguar
  • Develop more training material

Compare (NIH) Toxicity predictions
(ToxTree) Literature extraction
(OSCAR3) Clustering (BCI Toolkit) Docking,
filtering, ... (OpenEye)Varuna simulation
Web Service Locations
  • Cambridge University
  • InChI generation / search
  • OpenBabel
  • Indiana University
  • Clustering
  • VOTables
  • OSCAR3
  • Toxicity classification
  • Database services

SDSCTypical TeraGrid Site
  • InfoChem
  • SPRESI database

NIH PubChem .. Compare ..
  • Penn State University
  • (now moved to IU)
  • CDK based services
  • Fingerprints
  • Similarity calculations
  • 2D structure diagrams
  • Molecular descriptors

Cheminformatics Education at IU
  • Linked to bioinformatics in Indiana Universitys
    School of Informatics
  • School of Informatics degree programs BS, MS, PhD
  • Programs offered at both the Indianapolis (IUPUI)
    and Bloomington (IUB) campuses
  • Bioinformatics MS and track on PhD
  • Chemical Informatics MS and track on PhD
  • Informatics Undergraduates can choose a chemistry
    cognate (change to Life Sciences )
  • PhD in Informatics started in August 2005 and
    offers tracks in
  • bioinformatics chemical informatics health
    informatics human-computer interaction design
    social and organizational informatics more to
  • Good employer interest but modest student
    understanding of value of Cheminformatics degree
  • 3 core courses in Cheminformatics plus
    seminar/independent studies
  • Significant interest in distance education
    version of introductory Cheminformatics course
    (enrollment promising in Distance Graduate
    Certificate in Chemical Informatics)

Current Status
  • Web site http//www.chembiogrid.org
  • Wiki chosen to support project as a shared
    editable web space
  • Building Collaboratory involving PubChem Global
    Information System accessible anywhere and at any
    time enhance PubChem with distributed tools
    (clustering, simulation, annotation etc.) and
  • Adopted Taverna as workflow as popular in
    Bioinformatics but we will evaluate other systems
    such as GPEL from LEAD
  • Demonstrated CI-enhanced Chemistry simulations
  • Initiated Data-mining, User interface and
    Chemical Informatics tools research
  • Prototyped large set of runs on local Big Red 23
    Teraflop supercomputer (OSCAR3 and modeling
    moving to CDK Gamess Jaguar)
  • Initial results discussed at conferences/workshops
  • Gordon Conferences, ACS, SDSC tutorial
  • First new Cheminformatics courses offered
  • Advisory board set up and met this is second
  • Videoconferencing-based meetings with Peter
    Murray-Rust and group at Cambridge roughly every
    2-3 weeks
  • Good or potentially good interactions with Local
    HTS in CGB, NIH DTP, Scripps, Lilly and Michigan

MLSCN Post-HTS Biology Decision Support
Percent Inhibition or IC50 data is retrieved from
Grids can link data analysis ( e.g image
processing developed in existing Grids),
traditional Chem-informatics tools, as well as
annotation tools (Semantic Web, del.icio.us) and
enhance lead ID and SAR analysis A Grid of Grids
linking collections of services atPubChem ECCR
centers MLSCN centers
Workflows encoding plate control well
statistics, distribution analysis, etc
Question Was this screen successful?
Workflows encoding distribution analysis of
screening results
Question What should the active/inactive cutoffs

Question What can we learn about the target
protein or cell line from this screen?
Workflows encoding statistical comparison of
results to similar screens, docking of compounds
into proteins to correlate binding, with
activity, literature search of active compounds,
Compounds submitted to PubChem
Example HTS workflow finding cell-protein
A protein implicated in tumor growth with known
ligand is selected (in this case HSP90 taken from
the PDB 1Y4 complex)
The screening data from a cellular HTS assay is
similarity searched for compounds with similar 2D
structures to the ligand.
Docking results and activity patterns fed into R
services for building of activity models and
LeastSquares Regression
Similar structures are filtered for drugability,
are converted to 3D, and are automatically passed
to the OpenEye FRED docking program for docking
into the target protein.
Once docking is complete, the user visualizes the
high-scoring docked structures in a portlet using
the JMOL applet.
Similar structures to the ligand can be browsed
using client portlets.
Varuna environment for molecular modeling (Baik,
Chemical Concepts
Papers etc.
Simulation ServiceFORTRAN Code, Scripts
DB ServiceQueries, Clustering,Curation, etc.
QM Database
PubChem, PDB,NCI, etc.
QM/MM Database
Methods Development at the CICC
  • Tagging methods for web-based annotation
    exploiting del.icio.us and Connotea
  • Development of QSAR model interpretability and
    applicability methods
  • RNN-Profiles for exploration of chemical spaces
  • VisualiSAR - SAR through visual analysis
  • See http//www.daylight.com/meetings/mug99/Wild/Mu
  • Visual Similarity Matrices for High Volume
  • See http//www.osl.iu.edu/chemuell/new/bioinforma
  • Fast, accurate clustering using parallel Divisive
  • Mapping of Natural Language queries to use cases
    and workflows
  • Advanced data mining models for drug discovery

Structure of Proposal
  • a) Define audience that we are targeting
  • b) Cyberinfrastructure Framework with Key
    services -- Registry, Computing, portal, workflow
  • Exemplar Chemoinformatics Services
  • Exemplar workflows using services
  • Defined WSDL for key cases defined to allow
    others to contribute
  • Tutorial
  • c) Education
  • d) IT/Cyber-enhanced Computational Chemistry
  • e) Cheminformatics Research
  • Systems
  • Tools and Modeling

  • We expect to respond to big NIH RFP in about 4
  • Should we partner with Michigan?
  • Who is customer and how do we get more?
  • Do/Should chemists want our or more generally
    NIHs product?
  • Interactions with large and small industry
  • What is balance between infrastructure,
    computational chemistry, Cheminformatics tools
    and research, chemical informatics systems and
  • Should we stress literature (OSCAR3) project?
  • Balance of applications and generic capabilities?
  • How should we structure education component?
  • Field does not have strong student appeal
    compared to Bioinformatics
  • We are strong in Computer Sciences
    (Grids/Cyberinfrastructure) but doubtful if any
    CS reviewers
  • We are strong in Cheminformatics systems but not
    clear a recognized activity and how do we justify
    claim that Grids/Cyberinfrastructure/Open Access
  • Should we link more with biology?

Covering our bases Who are our Customers?
What do we need to conquer traditional chemical
Research Community
- High-Fidelity Structural Data, Redox
Potentials, Spectroscopy, Transition State
Structures, Energies, Molecular Orbitals..
Departments of the future Center
About PowerShow.com