Building a Chemical Informatics Grid - PowerPoint PPT Presentation

View by Category
About This Presentation

Building a Chemical Informatics Grid


NIH Roadmap for Medical Research. ... Data published, publicly available through NIH PubChem online database. ... – PowerPoint PPT presentation

Number of Views:37
Avg rating:3.0/5.0
Slides: 47
Provided by: Marlon74


Write a Comment
User Comments (0)
Transcript and Presenter's Notes

Title: Building a Chemical Informatics Grid

Building a Chemical Informatics Grid
  • Marlon Pierce
  • Community Grids Laboratory
  • Indiana University

Chemical Informatics as a Grid Application
  • Chemical Informatics is the application of
    information technology to problems in chemistry.
  • Example problems managing data in large scale
    drug discovery and molecular modeling
  • Building Blocks Chemical Informatics Resources
  • Chemical databases maintained by various groups
  • NIH PubChem, NIH DTP
  • Application codes (both commercial and open
  • Data mining, clustering
  • Quantum chemistry and molecular modeling
  • Visualization tools
  • Web resources journal articles, etc.
  • A Chemical Informatics Grid will need to
    integrate these into a common, loosely coupled,
    distributed computing environment.

Problem Connecting It Together
  • The problem is defining an architecture for tying
    all of these pieces into a distributed computing
  • A Grid
  • How can we combine application codes, web
    resources, and databases to solve a particular
    science problem?
  • Specifically, how do we build a runtime
    environment that can connect the distributed
    services we need to solve an interesting problem?
  • For academic and government researchers, how can
    we do all of this in an open fashion?
  • Data and services can come from anywhere
  • That is, we must avoid proprietary
  • Individual pieces may be commercial, however.

NIH Roadmap for Medical Researchhttp//nihroadmap
  • The NIH recognizes chemical and biological
    information management as critical to medical
  • Federally funded high throughput screening
  • 100-200 HTS assays per year on small molecules.
  • 100,000s of small molecules analyzed
  • Data published, publicly available through NIH
    PubChem online database.
  • What do you do with all of this data?
  • That is, how can you create an extensible toolbox
    of services that can be combined into interesting
    applications for clustering, mining, modeling,
    etc. the data.

The Solution, Part I Web Services
  • Web Services provide the means for wrapping
    databases, applications, web scavengers, etc,
    with programming interfaces.
  • WSDL definitions define how to write clients to
    talk with databases, applications, etc.
  • Web Service messaging through SOAP
  • Discovery services such as UDDI, MDS, and so on.
  • Many toolkits available
  • Axis, .NET, gSOAP, SOAPLite, etc.
  • Web Services can be combined with each other into
  • Workflowuse case scenario
  • More about this later.

Basic Architectures Servlets/CGI and Web Services
GUI Client
Web Server
Web Server
Web Server
Solution Part II Grid Resources
  • Many Grid tools provide powerful backend services
  • Globus uniform, secure access to computing
    resources (like TeraGrid)
  • File management, resource allocation management,
  • Condor job scheduling on computer clusters and
  • SRB data grid access
  • OGSA-DAI uniform Grid interface to databases.
  • These have Web Service as well as other
    interfaces (or equivalently, protocols).

Solution, Part III Domain Specific Tools and
Standards --gtMore Services
  • For Chemical Informatics, we have a number of
    tools and standards.
  • Chemical string representations
  • Chemistry Markup Language
  • XML language for describing, exchanging data.
  • JUMBO 5 a CML parser and library
  • Glue Tools and Applications
  • Chemistry Development Kit (CDK)
  • OpenBabel
  • These are the basis for building interoperable
    Chemical Informatics Web Services
  • Analogous situations exist for other domains
  • Astronomy, Geosciences, Biology/Bioinformatics

Solution Part IV Workflows
  • Workflow engines allow you to connect services
    together into interesting composite applications.
  • This allows you to directly encode your
    scientific use case scenario as a graph of
    interacting services.
  • There are many workflow tools
  • Well briefly cover these later.
  • General guidance is to build web services first
    and then use workflow tools on top of these
  • Dont get married to a particular workflow
    technology yet, unless someone pays you.

Solution Part V User Interfaces
  • Web Services allow you to cleanly separate user
    interfaces from backend services.
  • Model-view-controller pattern for web
  • Client environments include
  • Grid and web service scripting environments
  • Desktop tools like Taverna and Kepler
  • Portlet-based Web portal systems
  • Typically, desktop tools like Taverna are used by
    power users to define interesting workflows.
  • Portals are for running canned workflows.

Wrapping Science Applications
Wrapping Science Applications as Services
  • Science Grid services typically must wrap legacy
    applications written in C or Fortran.
  • You must handle such problems as
  • Specifying several input and output files
  • These may need to be staged in
  • Launching executables and monitoring their
  • Specifying environment variables
  • Often these have also shell scripts to do some
    miscellaneous tasks.
  • How do you convert this to WSDL?
  • Or (equivalently) how do you automatically
    generate the XML job description for WS-GRAM?

Our Solution Apache Ant Services
  • Weve found using Apache Ant to be very useful
    for wrapping services.
  • Can call executables, set environment variables.
  • Lots of useful built-in shell-like tasks.
  • Extensible (write your own tasks).
  • Develop build scripts to run your application
  • You can easily call Ant from other Java programs.
  • So just write a wrapper service
  • We use both blocking (hold connection until
    return) and non-blocking version (suitable for
    long running codes).
  • In non-blocking case, Context web service is
    used for callbacks.

Flow Chart of SMILES to Cluster Partitioned of
BCI Web Service
SMILE String
Fingerprint (.scn)
Cluster Hierarchy (.dkm)
Generating the best levels
Clustering Fingerprints
Generating Fingerprints
Dictionary (Default)
New SMILE String
Extracting individual cluster partitions
Extracted Cluster Hierarchy (.clu)
One Column Process
Merge Process
BCI Clustering Service Methods
Service Method Description Input Output
makebitsGenerate Generate fingerprints from a SMILES structure SMIstring Fingerprint string
divkmGenerate Cluster fingerprints with Divkmeans SCNstring Clustered Hierarchy
smile2dkm Makebits divkm SMIstring Clustered Hierarchy
optclusGenerate Generate the best levels in a hierarchy DKMstring Best partition cluster level
rnnclusGenerate Extract individual cluster partitions DKMstring Indiv. cluster partitions
smile2ClusterPartitioned Generate a new SMILES structure w/ extra col. SMIstring New SMILES structure
A Library of Chemical Informatics Web Services
All Services Great and Small
  • Like most Grids, a Chemical Informatics Grid will
    have the classic styles
  • Data Grid Services these provide access to data
    sources like PubChem, etc.
  • Execution Grid Services used for running cluster
    analysis programs, molecular modeling codes, etc,
    on TeraGrid and similar places.
  • But we also need many additional services
  • Handling format conversions (InChIlt-gtSMILES)
  • Shipping and manipulating tabular data
  • Determining toxicity of compounds
  • Generating batch 2D images
  • So one of our core activities is build lots of

VOTables Handling Tabular Data
  • Developed by the Virtual Observatory community
    for encoding astronomy data.
  • The VOTable format is an XML representation of
    the tabular data (data coming from BCI, NIH DTP
    databases, and so on).
  • VOTables-compatible tools have been built
  • We just inherit them.
  • SAVOT and JAVOT JAVA Parser APIs for VOTable
    allow us to easily build VOTable-based
  • Web Services
  • Spread sheet
  • Plotting applications.
  • VOPlot and TopCat are two

Document Structure of VOTable
lt?xml version"1.0"?gt ltVOTABLE version"1.1 xmlnsxsihttp// xsinoNamespaceSchemaLocation"http//"gt ltRESOURCE gt ltTABLE name"results"gt ltFIELD nameCompoundName" ID"col1" datatypechar" arraysize/gt ltFIELD nameClustureNumber ID"col2 datatypeint/gt ltDATAgt ltTABLEDATAgt ltTRgtltTDgtAcemetacinlt/TDgtltTDgt1lt/TDlt/TRgt ltTRgtltTDgtCandesartanlt/TDgtltTDgt1lt/TDgtlt/TRgt ltTRgtltTDgtAcenocoumarollt/TDgtltTDgt2lt/TDgtlt/TRgt ltTRgtltTDgtDicumarollt/TDgtltTDgt2lt/TDgtlt/TRgt ltTRgtltTDgtPhenprocoumonlt/TDgtltTDgt2lt/TDgtlt/TRgt ltTRgtltTDgtTrioxsakenlt/TDgtltTDgt2lt/TDgtlt/TRgt ltTRgtltTDgtwarfarinlt/TDgtltTDgt2lt/TDgtlt/TRgt lt/TABLEDATAgt lt/DATAgt lt/TABLEgt lt/RESOURCEgt lt/VOTABLEgt
Compound Name Cluster Number
Acemetacin 1
Candesartan 1
Acenocoumarol 2
Dicumarol 2
Phenprocoumon 2
Trioxsalen 2
Warfarin 2
mrtd1.txt smiles representation of chemical
compounds along with its properties
Votable.xml xml representation of mrtd1.txt file
VOPlot Application from generated votable.xml
file Graph plotted on Mass (Xaxis) and PSA
Other Uses for VOTables
  • VOTables is a useful intermediate format for
    exchanging data between data bases.
  • Simple example exchange data between VARUNA
  • Each student in the Baik group maintains his/her
    on copy (sandbox purposes).
  • Often need to import/export individual data sets.
  • It is also good for storing intermediate results
    in workflows.
  • Value is not the format, but the fact that the
    XML can be manipulated programmatically.
  • Unions, subset, intersection operations

More Services WWMM Services
Services Descriptions Input Output
InChIGoogle Search an InChI structure through Google inchiBasic type Search result in HTML format
InChIServer Generate InChI version format An InChI structure
OpenBabelServer Transform a chemical format to another using Open Babel format inputData outputData options Converted chemical structure string
CMLRSSServer Generate CMLRSS feed from CML data mol, title description link, source Converted CMLRSS feed of CML data
CDK-Based Services
Common Substructure Calculates the common substructure between two molecules.
CDKsim Takes two SMILES and evaluates the Tanimoto coefficient (ratio of intersection to union of their fingerprints).
CDKdesc Calculates a variety of molecular and atomic descriptors for QSAR modeling
CDKws Fingerprint generation
CDKsdg Creates a jpeg of the compounds 2D structure
CDKStruct3D Generates 3D coordinates of a molecule from its SMILE
ToxTree Service
  • The Threshold of Toxicological Concern (TTC)
    establishes a level of exposure for all chemicals
    below which there would be no appreciable risk to
    human health.
  • ToxTree implements the Cramer Decision Tree
    approach to estimate TTC.
  • We have converted this into a service.
  • Uses SMILES as input.
  • Note the GUI must be separated from the library
    to be a service

Taverna Workflow for Toxic Hazard Estimation
OSCAR3 Service
  • Oscar3 is a tool for shallow, chemistry-specific
    natural language parsing of chemical documents
    (i.e. journal articles).
  • It identifies (or attempts to identify)
  • Chemical names singular nouns, plurals, verbs
    etc., also formulae and acronyms.
  • Chemical data Spectra, melting/boiling point,
    yield etc. in experimental sections.
  • Other entities Things like N(5)-C(3) and so on.
  • There is a larger effort, SciBorg, in this area
  • http//
  • This (like ToxTree) is potentially productively
    pleasingly parallelized.
  • It also has potentially very interesting Workflows

Extract abstracts
PubMed Query Service
Clustering Tools
Other Cheminfo Services
Extract SMILES
3D Structure Generator
Create initial 3D structures
MM Applications
Quantum Chemistry DB
Refined 3D structures
QM Chemistry Info
Use Cases and Workflows
  • Putting data and clustering together in a
    distributed environment.

Workflow, Services, and Science
  • Web Services work best as simple stateless
  • No implicit input, output, or interdependency of
  • Services must be composed into interesting
  • This is called workflow.
  • A good workflow ...
  • Is composed of independent services
  • Completely specifies an interesting science

Some Open Source Grid Workflow Projects
  • UK e-Science Projects Taverna
  • Scufl.xml scripting, GUI interface, works with
    Web Services.
  • Kepler
  • Works with Web services and the Globus Toolkit.
  • Condor DAGMan
  • Works over the top of Condors scheduler.
  • Extended by the GriPhyN Virtual Data System
  • Java CoGKits Karajan
  • XML workflow specification for scripting COG
  • Works with GT 2 and 4.
  • Community Grids Labs HPSearch
  • JavaScript scripting, works with Web services.
  • Indiana Extreme Labs Workflow Composer
  • Jython, BPEL (soon) scripting

(No Transcript)
Finding compound-protein relationships
A 2D structure is supplied for input into the
similarity search (in this case, the extracted
bound ligand from the PDB IY4 complex)
A protein implicated in tumor growth is supplied
to the docking program (in this case HSP90 taken
from the PDB 1Y4 complex)
Correlation of docking results and biological
fingerprints across the human tumor cell lines
can help identify potential mechanisms of action
of DTP compounds
The workflow employs our local NIH DTP database
service to search 200,000 compounds tested in
human tumor cellular assays for similar
structures to the ligand. Client portlets are
used to browse these structures
Once docking is complete, the user visualizes the
high-scoring docked structures in a portlet using
the JMOL applet.
Similar structures are filtered for drugability,
and are automatically passed to the OpenEye FRED
docking program for docking into the target
HTS data organization flagging
A tumor cell line is selected. The activity
results for all the compounds in the DTP database
in the given range are extracted from the
PostgreSQL database
OpenEye FILTER is used to calculate biological
and chemical properties of the compounds that are
related to their potential effectiveness as drugs
The compounds are clustered on chemical structure
similarity, to group similar compounds together
The compounds along with property and cluster
information are converted to VOTABLES format and
displayed in VOPLOT
Use Case Which of these hits should I follow up?
  • An HTS experiment has produced 10,000 possible
    hits out of a screening set of 2m compounds. A
    chemist on the project wants to know what the
    most promising series of compounds for follow-up
    are, based on
  • Series selection ? cluster analysis
  • Structure-activity relationships ? modal
  • Chemical and pharmacokinetic properties
    ?mitools, chemaxon
  • Compound history ?gNova / PostgreSQL
  • Patentability ? BCI Markush handling software
  • Toxicity
  • Synthetic feasibility
  • requires visualization tools!

A Workflow Scenario HTS Data Organization and
  • This workflow demonstrates how screening data can
    be flagged and organized for human analysis.
  • The compounds and data values for a particular
    screen are retrieved from the NIH DTP database
    and then are filtered to remove compounds with
    reactive groups, etc.
  • A tumor cell line is selected. The activity
    results for all the compounds in the DTP database
    in the given range are extracted from the
    PostgreSQL database
  • OpenEye FILTER is used to calculate biological
    and chemical properties of the compounds that are
    related to their potential effectiveness as drugs
  • ToxTree is used to flag the potential toxicities
    of compounds.
  • Divkmeans is used to add a column of cluster
  • Finally, the results are visualized using VOPlot
    and the 2D viewer applet.

Web Services
Example plots of our workflow output using VOPlot
and VOTables
Fingerprint Generator BCI Makebits
Cluster Analysis BCI Divkmeans
NIH Database Service PostgreSQL CHORD
Cluster Membership
Table Management VoTables
Cluster the compounds in the NIH DTP database by
chemical structure, then choose representative
compounds from the clusters and dock them into
PDB protein files of interest
SMILES ID Cluster Data
Plot Visualizer VoPlot
Docking Selector Script
3D Visualizer JMOL
2D-3D OpenEye OMEGA
Docking OpenEye FRED
PDB Database Service
Docked Complex
MOL File
PDB Structure Box
Use Case Are there any good ligands for my
  • A chemist is working on a project involving a
    particular protein target, and wants to know
  • Any newly published compounds which might fit the
    protein receptor site ? gNova / PostgreSQL,
    PubChem search, FRED Docking
  • Any published 3D structures of the protein or of
    protein-ligand complexes ? PDB search
  • Any interactions of compounds with other proteins
    ? gNova / PostgreSQL, PubChem search
  • Any information published on the protein target ?
    Journal text search

Use Case Who else is working on these structures?
  • A chemist is working on a chemical series for a
    particular project and wants to know
  • If anyone publishes anything using the same or
    related compounds PubChem search
  • Any new compounds added to the corporate
    collection which are similar or related ? gNova
    CHORD / PostgreSQL
  • If any patents are submitted that might overlap
    the compounds he is working on BCI Markush
    handling software
  • Any pharmacological or toxicological results for
    those or related compounds ? gNova CHORD /
    PostgreSQL, MiToolkit
  • The results for any other projects for which
    those compounds were screened ? gNova CHORD /
    PostgreSQL, PubChem search

Workflow for smile2ClusterPartitionedURL
Workflow for Toxic Hazard in Verbose
Diagram of Workflow2
Web Services
Beanshell Scripting