Building a Chemical Informatics Grid - PowerPoint PPT Presentation


PPT – Building a Chemical Informatics Grid PowerPoint presentation | free to download - id: 6d560-ZDc1Z


The Adobe Flash plugin is needed to view this content

Get the plugin now

View by Category
About This Presentation

Building a Chemical Informatics Grid


Prof. Geoffrey Fox, Prof. David Wild, Prof. Mookie Baik, Prof. Gary Wiggins, Dr. ... Chemistry Development Kit (CDK) OpenBabel ... – PowerPoint PPT presentation

Number of Views:153
Avg rating:3.0/5.0
Slides: 124
Provided by: marl129


Write a Comment
User Comments (0)
Transcript and Presenter's Notes

Title: Building a Chemical Informatics Grid

Building a Chemical Informatics Grid
  • Marlon Pierce
  • Community Grids Laboratory
  • Indiana University

  • CICC researchers and developers who contributed
    to this presentation
  • Prof. Geoffrey Fox, Prof. David Wild, Prof.
    Mookie Baik, Prof. Gary Wiggins, Dr. Jungkee Kim,
    Dr. Rajarshi Guha, Sima Patel, Smitha Ajay, Xiao
  • Thanks also to Prof. Peter Murray Rust and the
    WWMM group at Cambridge University
  • More info and

Chemical Informatics and the Grid
  • An overview of the basic problem and solution

Chemical Informatics as a Grid Application
  • Chemical Informatics is the application of
    information technology to problems in chemistry.
  • Example problems managing data in large scale
    drug discovery and molecular modeling
  • Building Blocks Chemical Informatics Resources
  • Chemical databases maintained by various groups
  • NIH PubChem, NIH DTP
  • Application codes (both commercial and open
  • Data mining, clustering
  • Quantum chemistry and molecular modeling
  • Visualization tools
  • Web resources journal articles, etc.
  • A Chemical Informatics Grid will need to
    integrate these into a common, loosely coupled,
    distributed computing environment.

Problem Connecting It Together
  • The problem is defining an architecture for tying
    all of these pieces into a distributed computing
  • A Grid
  • How can I combine application codes, web
    resources, and databases to solve a particular
    problem that interests me?
  • Specifically, how do I build a runtime
    environment that can connect the distributed
    services I need to solve an interesting problem?
  • For academic and government researchers, how can
    I do all of this in an open fashion?
  • Data and services can come from anywhere
  • That is, I must avoid proprietary infrastructure.

NIH Roadmap for Medical Researchhttp//nihroadmap
  • The NIH recognizes chemical and biological
    information management as critical to medical
  • Federally funded high throughput screening
  • 100-200 HTS assays per year on small molecules.
  • 100,000s of small molecules analyzed
  • Data published, publicly available through NIH
    PubChem online database.
  • What do you do with all of this data?

High-Throughput Screening
Testing perhaps millions of compounds in a
corporate collection to see if any show activity
against a certain disease protein
High-Throughput Screening
  • Traditionally, small numbers of compounds were
    tested for a particular project or therapeutic
  • About 10 years ago, technology developed that
    enabled large numbers of compounds to be assayed
  • High-throughput screening can now test 100,000
    compounds a day for activity against a protein
  • Maybe tens of thousands of these compounds will
    show some activity for the protein
  • The chemist needs to intelligently select the 2 -
    3 classes of compounds that show the most promise
    for being drugs to follow-up

Informatics Implications
  • Need to be able to store chemical structure and
    biological data for millions of data points
  • Computational representation of 2D structure
  • Need to be able to organize thousands of active
    compounds into meaningful groups
  • Group similar structures together and relate to
  • Need to learn as much information as
    possible(data mining)
  • Apply statistical methods to the structures and
    related information
  • Need to use molecular modeling to gain direct
    chemical insight into reactions.

The Solution, Part I Web Services
  • Web Services provide the means for wrapping
    databases, applications, web scavengers, etc,
    with programming interfaces.
  • WSDL definitions define how to write clients to
    talk with databases, applications, etc.
  • Web Service messaging through SOAP
  • Discovery services such as UDDI, MDS, and so on.
  • Many toolkits available
  • Axis, .NET, gSOAP, SOAPLite, etc.
  • Web Services can be combined with each other into
  • Workflowuse case scenario
  • More about this later.

Basic Architectures Servlets/CGI and Web Services
GUI Client
Web Server
Web Server
Web Server
Solution Part II Grid Resources
  • Many Grid tools provide powerful backend services
  • Globus uniform, secure access to computing
    resources (like TeraGrid)
  • File management, resource allocation management,
  • Condor job scheduling on computer clusters and
  • SRB data grid access
  • OGSA-DAI uniform Grid interface to databases.
  • These have Web Service as well as other
    interfaces (or equivalently, protocols).

Solution, Part III Domain Specific Tools and
Standards --gtMore Services
  • For Chemical Informatics, we have a number of
    tools and standards.
  • Chemical string representations
  • Chemistry Markup Language
  • XML language for describing, exchanging data.
  • JUMBO 5 a CML parser and library
  • Glue Tools and Applications
  • Chemistry Development Kit (CDK)
  • OpenBabel
  • These are the basis for building interoperable
    Chemical Informatics Web Services
  • Analogous situations exist for other domains
  • Astronomy, Geosciences, Biology/Bioinformatics

Solution Part IV Workflows
  • Workflow engines allow you to connect services
    together into interesting composite applications.
  • This allows you to directly encode your
    scientific use case scenario as a graph of
    interacting services.
  • There are many workflow tools
  • Well briefly cover these later.
  • General guidance is to build web services first
    and then use workflow tools on top of these
  • Dont get married to a particular workflow
    technology yet, unless someone pays you.

Solution Part V User Interfaces
  • Web Services allow you to cleanly separate user
    interfaces from backend services.
  • Model-view-controller pattern for web
  • Client environments include
  • Grid and web service scripting environments
  • Desktop tools like Taverna and Kepler
  • Portlet-based Web portal systems
  • Typically, desktop tools like Taverna are used by
    power users to define interesting workflows.
  • Portals are for running canned workflows.

Next steps
  • Next we will review the online data base
    resources that are available to us.
  • Databases come in two varieties
  • Journal databases
  • Data databases
  • As we will discuss, it is useful to build
    services and workflows for automatically
    interacting with both types.

Online Chemical Journal and Data Resources
MEDLINE Online Journal Database
  • MEDLINE (Medical Literature Analysis and
    Retrieval System Online) is an international
    literature database of life sciences and
    biomedical information.
  • It covers the fields of medicine, nursing,
    dentistry, veterinary medicine, and health care.
  • MEDLINE covers much of the literature in biology
    and biochemistry, and fields with no direct
    medical connection, such as molecular evolution.
  • It is accessed via PubMed.

PubMed Journal Search Engine
  • PubMed is a free search engine offered by the
    United States National Library of Medicine as
    part of the Entrez information retrieval system.
  • The PubMed service allows searching the MEDLINE
  • MEDLINE covers over 4,800 journals published in
    the United States and more than 70 other
    countries primarily from 1966 to the present.
  • In addition to MEDLINE, PubMed also offers access
  • OLDMEDLINE for pre-1966 citations.
  • Citations to articles that are out-of-scope
    (e.g., general science and chemistry) from
    certain MEDLINE journals
  • In-process citations which provide a record for
    an article before it is indexed with MeSH and
    added to MEDLINE
  • Citations that precede the date that a journal
    was selected for MEDLINE indexing
  • Some life science journals

PubChem Chemical Database
  • PubChem is a database of chemical molecules.
  • The system is maintained by the National Center
    for Biotechnology Information (NCBI) which
    belongs to the United States National Institutes
    of Health (NIH).
  • PubChem can be accessed for free through a web
    user interface.
  • And Web Services for programmatic access
  • PubChem contains mostly small molecules with a
    molecular mass below 500.
  • Anyone can contribute
  • The database is free to use, but it is not
    curated, so value of a specific compound
    information could be questionable.
  • NIH funded HTS results are (intended to be)
    available through pubchem.

NIH DTP Database
  • Part of NIHs Developmental Therapeutics Program.
  • Screens up to 3,000 compounds per year for
    potential anticancer activity.
  • Utilizes 59 different human tumor cell lines,
    representing leukemia, melanoma and cancers of
    the lung, colon, brain, ovary, breast, prostate,
    and kidney.
  • DTP screening results are part of PubChem and
    also available as a separate database.

Example screening results. Positive results (red
bar to right of vertical line) indicates greater
than average toxicity of cell line to tested
  • COMPARE is an algorithm for mining DTP result
    data to find and rank order compounds with
    similar DTP screening results.
  • Why COMPARE?
  • Discovered compounds may be less toxic to humans
    but just as effective against cancer cell lines.
  • May be much easier/safer to manufacture.
  • May be a guide to deeper understanding of

Many Other Online Databases
  • Complementary protein information
  • Indiana University Varuna project
  • Discussed in this presentation
  • University of Michigan Binding MOAD
  • Mother of All Databases
  • Largest curated database of protein-ligand
  • Subset of protein databank
  • Prof. Heather Carlson
  • University of Michigan PDBBind
  • Provides a collection of experimentally measured
    binding affinity data (Kd, Ki, and IC50)
    exclusively for the protein-ligand complexes
    available in the Protein Data Bank (PDB)
  • Dr. Shaomeng Wang

The Point Is
  • All of these databases can be accessed on line
    with human-usable interfaces.
  • But thats not so important for our purposes
  • More importantly, many of them are beginning to
    define Web Service interfaces that let other
    programs interact with them.
  • Plenty of tools and libraries can simulate
    browsers, so you can also build your own service.
  • This allows us to remotely analyze databases with
    clustering and other applications without
    modifying the databases themselves.
  • Can be combined with text mining tools and web
    robots to find out who else is working in the

Encoding chemistry
Chemical Machine Languages
  • Interestingly, chemistry has defined three simple
    languages for encoding chemical information.
  • Can generate these by hand or automatically
  • InChIs and SMILES can represent molecules as a
    single string/character array.
  • Useful as keys for databases and for search
    queries in Google.
  • You can convert between SMILES and InChIs
  • OpenBabel, OELib, JOELib
  • CML is an XML format, and more verbose, but
    benefits from XML community tools

SMILES Simplified Molecular Input Line Entry
  • Language for describing the structure of chemical
    molecules using ASCII strings.

InChI International Chemical Identifier
  • IUPAC and NIST Standard similar to SMILES
  • Encodes structural information about compounds
  • Based on open an standard and algorithms.

InChI in Public Chemistry Databases
  • US National Institute of Standards and Technology
    (NIST) - 150,000 structures
  • NIH/NCBI/PubChem project - gt3.2 million
  • Thomson ISI - 2 million structures
  • US National Cancer Institute(NCI) Database - 23
    million structures
  • US Environmental Protection Agency(EPA)-DSSToX
    Database - 1450 structures
  • Kyoto Encyclopaedia of Genes and Genomes (KEGG)
    database - 9584 structures
  • University of California at San Francisco ZINC -
    gt3.3 million structures
  • BRENDA enzyme information system (University of
    Cologne) - 36,000 structures
  • Chemical Entities of Biological Interest (ChEBI)
    database of the European Bioinformatics Institute
    - 5000 structures
  • University of California Carcinogenic Potency
    Project - 1447 structures
  • Compendium of Pesticide Common Names - 1437
    (2005-03-03) structures

Journals and Software Using InChI
  • Journals
  • Nature Chemical Biology.
  • Beilstein Journal of Organic Chemistry
  • Software
  • ACD/Labs ACD/ChemSketch.
  • ChemAxon Marvin.
  • SciTegic Pipeline Pilot.
  • CACTVS Chemoinformatics Toolkit by Xemistry,

Chemistry Markup Language
  • CML is an XML markup language for encoding
    chemical information.
  • Developed by Peter Murray Rust, Henry Rzepa and
  • Actually dates from the SGML days before XML
  • More verbose than InChI and SMILES
  • But inherits XML schema, namespaces, parsers,
    XPATH, language binding tools like XML Beans,
  • Not limited to structural information
  • Has OpenBabel support.

http//, http//cml.sourceforg
InChI Compared to SMILES
  • SMILES is proprietary and different algorithms
    can give different results.
  • Seven different unique SMILES for caffeine on Web
  • c1(n(CH3)c(c2(c(n1CH3)ncHn
  • CN1C(O)N(C)C(O)C(N(C)CN2)C12
  • Cn1cnc2n(C)c(O)n(C)c(O)c12
  • Cn1cnc2c1c(O)n(C)c(O)n2C
  • N1(C)C(O)N(C)C2C(C1O)N(C)CN2
  • OC1C2C(NCN2C)N(C(O)N1C)C
  • CN1CNC2C1C(O)N(C)C(O)N2C

On the other hand, some claim SMILES are more
intuitive for human readers.
A CML Example
Clustering Techniques, Computing Requirements,
and Clustering Services
  • Computational techniques for organizing data

The Story So Far
  • Weve discussed managing screening assay output
    as the key problem we face
  • Must sift through mountains of data in PubChem
    and DTP to find interesting compounds.
  • NIH funded High Throughput Screening will make
    this very important in the near future.
  • Need now a way to organize and analyze the data.

Clustering and Data Analysis
  • Clustering is a technique that can be applied to
    large data sets to find similarities
  • Popular technique in chemical informatics
  • Data sets are segmented into groups (clusters) in
    which members of the same cluster are similar to
    each other.
  • Clustering is distinct from classification,
  • There are no pre-determined characteristics used
    to define the membership of a cluster,
  • Although items in the same cluster are likely to
    have many characteristics in common.
  • Clustering can be applied to chemical structures,
    for example, in the screening of combinatorial or
    Markush compound libraries in the quest for new
    active pharmaceuticals.
  • We also note that these techniques are fairly
  • More interesting clustering techniques exist but
    apparently are not well known by the chemical
    informatics community.

Non-Hierarchical Clustering
  • Clusters form around centroids.
  • The number of which can be specified by the user.
  • All clusters rank equally and there is no
    particular relationship between them.

Hierarchical Clustering
  • Clusters are arranged in hierarchies
  • Smaller clusters are contained within larger
    ones the bottom of the hierarchy consists of
    individual objects in "singleton" clusters, while
    the top of it consists of one cluster containing
    all the objects in the dataset.
  • Such hierarchies can be built either from the
    bottom up (agglomerative) or the top downwards

Fingerprinting and Dictionaries--What Is Your
Parameter Space?
  • Clustering algorithms require a parameter space
  • Clusters defined along coordinate axes.
  • Coordinate axes defined by a dictionary of
    chemical structures.
  • Use binary on/off for fingerprinting a particular
    compound against a dictionary.

Cluster Analysis and Chemical Informatics
  • Used for organizing datasets into chemical
    series, to build predictive models, or to select
    representative compounds
  • Clustering Methods
  • Jarvis-Patrick and variants
  • O(N2), single partition
  • Wards method
  • Hierarchical, regarded as best, but at least
  • K-means
  • lt O(N2), requires set no of clusters, a little
  • Sphere-exclusion (Butina)
  • Fast, simple, similar to JP
  • Kohonen network
  • Clusters arranged in 2D grid, ideal for

Limitations of Wards method forlarge datasets
  • Best algorithms have O(N2) time requirement (RNN)
  • Requires random access to fingerprints
  • hence substantial memory requirements (O(N))
  • Problem of selection of best partition
  • can select desired number of clusters
  • Easily hit 4GB memory addressing limit on 32 bit
  • Approximately 2m compounds

Scaling up clustering methods
  • Parallelization
  • Clustering algorithms can be adapted for multiple
  • Some algorithms more appropriate than others for
    particular architectures
  • Wards has been parallelized for shared memory
    machines, but overhead considerable
  • New methods and algorithms
  • Divisive (bisecting) K-means method
  • Hierarchical Divisive
  • Approx. O(NlogN)

Divisive K-means Clustering
  • New hierarchical divisive method
  • Hierarchy built from top down, instead of bottom
  • Divide complete dataset into two clusters
  • Continue dividing until all items are singletons
  • Each binary division done using K-means method
  • Originally proposed for document clustering
  • Bisecting K-means
  • Steinbach, Karypis and Kumar (Univ.
  • Found to be more effective than agglomerative
  • Forms more uniformly-sized clusters at given

BCI Divkmeans
  • Several options for detailed operation
  • Selection of next cluster for division
  • size, variance, diameter
  • affects selection of partitions from hierarchy,
    not shape of hierarchy
  • Options within each K-means division step
  • distance measure
  • choice of seeds
  • batch-mode or continuous update of centroids
  • termination criterion
  • Have developed parallel version for Linux
    clusters / grids in conjunction with BCI
  • For more information, see Barnard and Engels
    talks at http//

Comparative execution timesNCI subsets, 2.2 GHz
Intel Celeron processor
7h 27m
3h 06m
2h 25m
Divisive K-means Conclusions
  • Much faster than Wards, speed comparable to
    K-means, suitable for very large datasets
  • Time requirements approximately O(N log N)
  • Current implementation can cluster 1m compounds
    in under a week on a low-power desktop PC
  • Cluster 1m compounds in a few hours with a 4-node
    parallel Linux cluster
  • Better balance of cluster sizes than Wards or
  • Visual inspection of clusters suggests better
    assembly of compound series than other methods
  • Better clustering of actives together than
    previously-studied methods
  • Memory requirements minimal
  • Experiments using AVIDD cluster and Teragrid
    forthcoming(50 nodes)

  • Effective exploitation of large volumes and
    diverse sources of chemical information is a
    critical problem to solve, with a potential huge
    impact on the drug discovery process
  • Most information needs of chemists and drug
    discovery scientists are conceptually
    straightforward, but complex to implement
  • All of the technology is now in place to
    implement may of these information need
    use-cases the four level model using
    service-oriented architectures together with
    smart clients look like a neat way of doing this
  • In conjunction with grid computing, rapid and
    effective organization and visualization of large
    chemical datasets is feasible in a web service
  • Some pieces are missing
  • Chemical structure search of journals (wait for
  • Automated patent searching
  • Effective dataset organization
  • Effective interfaces, especially visualization of
    large numbers of 2D structures

Divisive K-Means as a Web Service
  • The previous exercise was intended to show that
    Divisive K-Means is a classic example of Grid
  • Needs to be parallelized
  • Should run on TeraGrid
  • How do you make this into a service?
  • Well go on a small tour before getting back to
    our problem.

Wrapping Science Applications as Services
  • Science Grid services typically must wrap legacy
    applications written in C or Fortran.
  • You must handle such problems as
  • Specifying several input and output files
  • These may need to be staged in
  • Launching executables and monitoring their
  • Specifying environment variables
  • Often these have also shell scripts to do some
    miscellaneous tasks.
  • How do you convert this to WSDL?
  • Or (equivalently) how do you automatically
    generate the XML job description for WS-GRAM?

Generic Service Toolkit (GFAC)(G. Kandaswamy, IU
and RENCI)
  • The Generic Service Toolkit can "wrap" any
    command-line application as an application
  • Given a set of input parameters, it runs the
    application, monitors the application and returns
    the results.
  • Requires no modification to program code.
  • Also has web user interface generating tools.
  • When a user accesses an application service, the
    user is presented with a graphical user interface
    (GUI) to that service.
  • The GUI contains a list of operations that the
    user is allowed to invoke on that service.
  • After choosing an operation, the user is
    presented with a GUI for that operation, which
    allows the user to specify all the input
    parameters to that operation.
  • The user can then invoke the operation on the
    service and get the output results.
OPAL (S. Krishan, SDSC)
  • Features include scheduling (using Globus and
    Condor/SGE) and security (using GSI-based
    certificates), and persistent state management.
  • The WSDL defines operations to do the following
  • getAppMetadata includes usage information,
    arbitrary application-specific metadata specified
    as an array of other elements,
  • e.g. description of the various options that are
    passed to the application binary.
  • launchJob runs job with specified input and
    returns a Job ID.
  • queryStatus returns status code, message, and
    URL of the working directory
  • getOutputs returns the outputs from a job that
    is identified by a Job ID.
  • URLs for the standard output and error
  • Array of structures representing the output file
    names and URLs
  • getOutputAsBase64ByName This operation returns
    the contents of an output file as Base64 binary.
  • destroy This operation destroys a running job
    identified by a Job ID.
  • launchJobBlocking This operation requires the
    list of arguments as a string, and an array of
    structures representing the input files.

Our Solution Apache Ant Services
  • Weve found using Apache Ant to be very useful
    for wrapping services.
  • Can call executables, set environment variables.
  • Lots of useful built-in shell-like tasks.
  • Extensible (write your own tasks).
  • Develop build scripts to run your application
  • You can easily call Ant from other Java programs.
  • So just write a wrapper service
  • We use both blocking (hold connection until
    return) and non-blocking version (suitable for
    long running codes).
  • In non-blocking case, Context web service is
    used for callbacks.

Flow Chart of SMILES to Cluster Partitioned of
BCI Web Service
SMILE String
Fingerprint (.scn)
Cluster Hierarchy (.dkm)
Generating the best levels
Clustering Fingerprints
Generating Fingerprints
Dictionary (Default)
New SMILE String
Extracting individual cluster partitions
Extracted Cluster Hierarchy (.clu)
One Column Process
Merge Process
BCI Clustering Service Methods
Service Method Description Input Output
makebitsGenerate Generate fingerprints from a SMILES structure SMIstring Fingerprint string
divkmGenerate Cluster fingerprints with Divkmeans SCNstring Clustered Hierarchy
smile2dkm Makebits divkm SMIstring Clustered Hierarchy
optclusGenerate Generate the best levels in a hierarchy DKMstring Best partition cluster level
rnnclusGenerate Extract individual cluster partitions DKMstring Indiv. cluster partitions
smile2ClusterPartitioned Generate a new SMILES structure w/ extra col. SMIstring New SMILES structure
A Library of Chemical Informatics Web Services
All Services Great and Small
  • Like most Grids, a Chemical Informatics Grid will
    have the classic styles
  • Data Grid Services these provide access to data
    sources like PubChem, etc.
  • Execution Grid Services used for running cluster
    analysis programs, molecular modeling codes, etc,
    on TeraGrid and similar places.
  • But we also need many additional services
  • Handling format conversions (InChIlt-gtSMILES)
  • Shipping and manipulating tabular data
  • Determining toxicity of compounds
  • Generating batch 2D images
  • So one of our core activities is build lots of

VOTables Handling Tabular Data
  • Developed by the Virtual Observatory community
    for encoding astronomy data.
  • The VOTable format is an XML representation of
    the tabular data (data coming from BCI, NIH DTP
    databases, and so on).
  • VOTables-compatible tools have been built
  • We just inherit them.
  • SAVOT and JAVOT JAVA Parser APIs for VOTable
    allow us to easily build VOTable-based
  • Web Services
  • Spread sheet
  • Plotting applications.
  • VOPlot and TopCat are two

Document Structure of VOTable
lt?xml version"1.0"?gt ltVOTABLE version"1.1 xmlnsxsihttp// xsinoNamespaceSchemaLocation"http//"gt ltRESOURCE gt ltTABLE name"results"gt ltFIELD nameCompoundName" ID"col1" datatypechar" arraysize/gt ltFIELD nameClustureNumber ID"col2 datatypeint/gt ltDATAgt ltTABLEDATAgt ltTRgtltTDgtAcemetacinlt/TDgtltTDgt1lt/TDlt/TRgt ltTRgtltTDgtCandesartanlt/TDgtltTDgt1lt/TDgtlt/TRgt ltTRgtltTDgtAcenocoumarollt/TDgtltTDgt2lt/TDgtlt/TRgt ltTRgtltTDgtDicumarollt/TDgtltTDgt2lt/TDgtlt/TRgt ltTRgtltTDgtPhenprocoumonlt/TDgtltTDgt2lt/TDgtlt/TRgt ltTRgtltTDgtTrioxsakenlt/TDgtltTDgt2lt/TDgtlt/TRgt ltTRgtltTDgtwarfarinlt/TDgtltTDgt2lt/TDgtlt/TRgt lt/TABLEDATAgt lt/DATAgt lt/TABLEgt lt/RESOURCEgt lt/VOTABLEgt
Compound Name Cluster Number
Acemetacin 1
Candesartan 1
Acenocoumarol 2
Dicumarol 2
Phenprocoumon 2
Trioxsalen 2
Warfarin 2
mrtd1.txt smiles representation of chemical
compounds along with its properties
Taverna Client
Tomcat Server

WSDL VOTableGeneratorServ
ice retrieveVOTableDocument
Votable.xml xml representation of mrtd1.txt file
VOPlot Application from generated votable.xml
file Graph plotted on Mass (Xaxis) and PSA
Other Uses for VOTables
  • VOTables is a useful intermediate format for
    exchanging data between data bases.
  • Simple example exchange data between VARUNA
  • Each student in the Baik group maintains his/her
    on copy (sandbox purposes).
  • Often need to import/export individual data sets.
  • It is also good for storing intermediate results
    in workflows.
  • Value is not the format, but the fact that the
    XML can be manipulated programmatically.
  • Unions, subset, intersection operations

More Services WWMM Services
Services Descriptions Input Output
InChIGoogle Search an InChI structure through Google inchiBasic type Search result in HTML format
InChIServer Generate InChI version format An InChI structure
OpenBabelServer Transform a chemical format to another using Open Babel format inputData outputData options Converted chemical structure string
CMLRSSServer Generate CMLRSS feed from CML data mol, title description link, source Converted CMLRSS feed of CML data
CDK-Based Services
Common Substructure Calculates the common substructure between two molecules.
CDKsim Takes two SMILES and evaluates the Tanimoto coefficient (ratio of intersection to union of their fingerprints).
CDKdesc Calculates a variety of molecular and atomic descriptors for QSAR modeling
CDKws Fingerprint generation
CDKsdg Creates a jpeg of the compounds 2D structure
CDKStruct3D Generates 3D coordinates of a molecule from its SMILE
ToxTree Service
  • The Threshold of Toxicological Concern (TTC)
    establishes a level of exposure for all chemicals
    below which there would be no appreciable risk to
    human health.
  • ToxTree implements the Cramer Decision Tree
    approach to estimate TTC.
  • We have converted this into a service.
  • Uses SMILES as input.
  • Note the GUI must be separated from the library
    to be a service

Taverna Workflow for Toxic Hazard Estimation
OSCAR3 Service
  • Oscar3 is a tool for shallow, chemistry-specific
    natural language parsing of chemical documents
    (i.e. journal articles).
  • It identifies (or attempts to identify)
  • Chemical names singular nouns, plurals, verbs
    etc., also formulae and acronyms.
  • Chemical data Spectra, melting/boiling point,
    yield etc. in experimental sections.
  • Other entities Things like N(5)-C(3) and so on.
  • There is a larger effort, SciBorg, in this area
  • http//
  • This (like ToxTree) is potentially productively
    pleasingly parallelized.
  • It also has potentially very interesting Workflows

(No Transcript)
Use Cases and Workflows
  • Putting data and clustering together in a
    distributed environment.

Chemical Informatics as a Grid Problem
  • NIH-Funded experimental screening
  • NIH DTP and HTS projects are generating a wealth
    of raw data on small compounds.
  • Available in PubChem
  • Journal and chemical data sources often have
    public Web clients and GUIs.
  • But we need Web Service interfaces, not just Web
  • These provide a programming interfaces for
    building both human and machine clients.
  • These need to be connected to computing resources
    for running clustering, data mining, and
    molecular modeling applications.
  • Excellent candidates for running on the TeraGrid
  • We can formulate scientific problems that map to
    inter-connections of Grid services.
  • This is generally called Grid workflow or
    Service Orchestration

These compounds look promising from their HTS
results. Should I commit some chemistry resources
to following them up?
Workflow, Services, and Science
  • Web Services work best as simple stateless
  • No implicit input, output, or interdependency of
  • Services must be composed into interesting
  • This is called workflow.
  • A good workflow ...
  • Is composed of independent services
  • Completely specifies an interesting science

Some Open Source Grid Workflow Projects
  • UK e-Science Projects Taverna
  • Scufl.xml scripting, GUI interface, works with
    Web Services.
  • Kepler
  • Works with Web services and the Globus Toolkit.
  • Condor DAGMan
  • Works over the top of Condors scheduler.
  • Extended by the GriPhyN Virtual Data System
  • Java CoGKits Karajan
  • XML workflow specification for scripting COG
  • Works with GT 2 and 4.
  • Community Grids Labs HPSearch
  • JavaScript scripting, works with Web services.
  • Indiana Extreme Labs Workflow Composer
  • Jython, BPEL (soon) scripting

(No Transcript)
Finding compound-protein relationships
A 2D structure is supplied for input into the
similarity search (in this case, the extracted
bound ligand from the PDB IY4 complex)
A protein implicated in tumor growth is supplied
to the docking program (in this case HSP90 taken
from the PDB 1Y4 complex)
Correlation of docking results and biological
fingerprints across the human tumor cell lines
can help identify potential mechanisms of action
of DTP compounds
The workflow employs our local NIH DTP database
service to search 200,000 compounds tested in
human tumor cellular assays for similar
structures to the ligand. Client portlets are
used to browse these structures
Once docking is complete, the user visualizes the
high-scoring docked structures in a portlet using
the JMOL applet.
Similar structures are filtered for drugability,
and are automatically passed to the OpenEye FRED
docking program for docking into the target
HTS data organization flagging
A tumor cell line is selected. The activity
results for all the compounds in the DTP database
in the given range are extracted from the
PostgreSQL database
OpenEye FILTER is used to calculate biological
and chemical properties of the compounds that are
related to their potential effectiveness as drugs
The compounds are clustered on chemical structure
similarity, to group similar compounds together
The compounds along with property and cluster
information are converted to VOTABLES format and
displayed in VOPLOT
Use Case Which of these hits should I follow up?
  • An HTS experiment has produced 10,000 possible
    hits out of a screening set of 2m compounds. A
    chemist on the project wants to know what the
    most promising series of compounds for follow-up
    are, based on
  • Series selection ? cluster analysis
  • Structure-activity relationships ? modal
  • Chemical and pharmacokinetic properties
    ?mitools, chemaxon
  • Compound history ?gNova / PostgreSQL
  • Patentability ? BCI Markush handling software
  • Toxicity
  • Synthetic feasibility
  • requires visualization tools!

A Workflow Scenario HTS Data Organization and
  • This workflow demonstrates how screening data can
    be flagged and organized for human analysis.
  • The compounds and data values for a particular
    screen are retrieved from the NIH DTP database
    and then are filtered to remove compounds with
    reactive groups, etc.
  • A tumor cell line is selected. The activity
    results for all the compounds in the DTP database
    in the given range are extracted from the
    PostgreSQL database
  • OpenEye FILTER is used to calculate biological
    and chemical properties of the compounds that are
    related to their potential effectiveness as drugs
  • ToxTree is used to flag the potential toxicities
    of compounds.
  • Divkmeans is used to add a column of cluster
  • Finally, the results are visualized using VOPlot
    and the 2D viewer applet.

Web Services
Example plots of our workflow output using VOPlot
and VOTables
Fingerprint Generator BCI Makebits
Cluster Analysis BCI Divkmeans
NIH Database Service PostgreSQL CHORD
Cluster Membership
Table Management VoTables
Cluster the compounds in the NIH DTP database by
chemical structure, then choose representative
compounds from the clusters and dock them into
PDB protein files of interest
SMILES ID Cluster Data
Plot Visualizer VoPlot
Docking Selector Script
3D Visualizer JMOL
2D-3D OpenEye OMEGA
Docking OpenEye FRED
PDB Database Service
Docked Complex
MOL File
PDB Structure Box
Use Case Are there any good ligands for my
  • A chemist is working on a project involving a
    particular protein target, and wants to know
  • Any newly published compounds which might fit the
    protein receptor site ? gNova / PostgreSQL,
    PubChem search, FRED Docking
  • Any published 3D structures of the protein or of
    protein-ligand complexes ? PDB search
  • Any interactions of compounds with other proteins
    ? gNova / PostgreSQL, PubChem search
  • Any information published on the protein target ?
    Journal text search

Use Case Who else is working on these structures?
  • A chemist is working on a chemical series for a
    particular project and wants to know
  • If anyone publishes anything using the same or
    related compounds PubChem search
  • Any new compounds added to the corporate
    collection which are similar or related ? gNova
    CHORD / PostgreSQL
  • If any patents are submitted that might overlap
    the compounds he is working on BCI Markush
    handling software
  • Any pharmacological or toxicological results for
    those or related compounds ? gNova CHORD /
    PostgreSQL, MiToolkit
  • The results for any other projects for which
    those compounds were screened ? gNova CHORD /
    PostgreSQL, PubChem search

VARUNA Towards a Grid-based Molecular Modeling
  • A brief overview of Prof. Mookie Baiks VARUNA

Chemical Informatics in Academic Research?
  • Industrial Research Target Oriented
  • Not bound to a specific molecular system
  • Not bound to a method
  • Not concerned with generality
  • Aware of Efficiency
  • Aware of Overall Cost
  • Aware of Toxicity
  • Concerned about Formulations
  • Cares about active MOLECULES
  • Academic Research Concept Oriented
  • Specialized on few molecular families
  • Method Development is important
  • Obsessed with generality
  • Does not care much about efficiency
  • Cost is unimportant
  • Often cant even assess for Toxicity
  • Formulation is a minor issue
  • Cares mostly about REACTIONS, i.e.ways to GET to
    a molecule

AutoGeFF, Varuna and Workflows
  • Metalloproteins are extremely important in
    biochemical processes
  • Understanding their chemistry is difficult
  • To add value to the small molecule DBs (PubChem,
    etc.), we must somehow connect them to PDBs,
    BindMOAD, etc.
  • By extending Varunas functionality to handling,
    storing Metalloproteins, we could provide a

Automatic Generator of ForceFields (AutoGeFF)
  • Developing a service that can take ANY
  • drug-like molecule (from PubChem, for example)
  • metal complexes
  • metalloenzymes (from PDB, for example)
  • unnatural or functionalized amino acids,
    nucleobases (from in-house db)
  • for which molecular mechanics force fields are
    not available andautomatically generate FFs
    based on
  • High level Quantum Simulations (using Varuna as a
    Web service)
  • for Sophisticated Molecular Mechanics
  • First Step Coding of a specialized Prototype
    that can reproduce our manually derived novel
    force fields for Cu-Ab Alzheimers Disease as a
    Proof-Of-Principles Study.

Automatic Quantum Mechanical Curation of
Structure Data
  • Chemical Research logic is often driven by
    molecular structure
  • Large-scale, small molecule DBs (such as
    PubChem) have low-resolution structure data
  • Often key properties are not consistently
  • e.g. Rotation-barriers, Redox Potentials,
    Polarizabilities, IR frequencies, reactivity
    towards nucleophiles
  • QM web-services will provide tools for generating
    high-resolution data
  • that will curate the results of traditional
    ChemInfo studies
  • allow for combinatorial computational chemistry
  • access a database of modeling data

Prototype-Project Controlling the TGFb pathway
in-house Molecules in Varuna
Conceptual Understanding of TGFb Inhibition
Inactive TGFb
Active TGFb With inhibitor
  • Questions
  • - What molecular feature controls inhibitor
  • - How do mutations impact binding?

Experimentsin the Zhang Lab
Consequences for ChemInfo Design for Academia
  • TWO Strategies are needed
  • Making traditional ChemInfo tools that are often
    available in commercial research available to
    Academia is in principle straightforward.
  • New ChemInfo Tools that are CONCEPT centered and
    include REACTIONS in addition to MOLECULES must
    be developed.
  • Our approach Development of
  • (a) Quantum Chemical Database
  • (b) Molecular Modeling Database
  • Harness the power of recent advances in Molecular
    Modeling (QM, QM/MM, MM, MD) through information
  • Data-depository for Quantum Chemical Data
    including both Properties Mechanisms

QM Calculation Workflow
More Information
  • Contact me
  • Most of this was taken from our CICC project. See
  • Note weve found wikis to be extremely useful and
    fun to use for maintaining collaborative web
  • See also and for other examples
    using Media Wiki.
  • Many elements of our approach are based on Prof.
    Peter Murray Rusts groups approach.
  • WWMM Wiki
  • SourceForge Project Site
  • http//

Additional Slides
Use Case - CICCWhich of these hits should I
follow up?
  • An MLI HTS experiment has produced 10,000
    possible hits out of a screening set of 2m
    compounds. A chemist at another laboratory wants
    to know if there are any interesting active
    series she might want to pursue, based on
  • Structure-activity relationships
  • Chemical and pharmacokinetic properties
  • Compound history
  • Patentability
  • Toxicity
  • Synthetic feasibility

CICC Web Services I
  • BCI Clustering
  • Provides Bernard Chemical Information (BCI)
    clustering packages
  • A module of the workflow for HTS data
    organization and flagging
  • Status
  • Added URL output support to the previous solid
    prototype (Multi-user durable)
  • Taverna Beanshell Scripting for data format
    adjusting (e.g. Filtering out the head part
    listing column names)
  • To do Evaluating the URI(URL) based workflow
  • ToxTree
  • Estimates toxic hazard by applying a decision
    tree approach
  • A module of the workflow for HTS data
    organization and flagging
  • Status A test prototype producing the level of
    toxicity in a brief or verbose explanation
    against a SMILE structure
  • To do
  • Refining the Web service for cluster input and
    external property support
  • The Taverna Beanshell scripting for data merging
    not used in some modules

CICC Web Services II
  • Workflow for HTS data organization and flagging
  • Demonstrates how screening data can be flagged
    and organized for human analysis
  • Status Individual modules except the
    visualization are in prototype
  • To do
  • Defining at least XML schema or DTD for the
    workflow data (at most the Ontology)
  • Redefining current workflow model to reflect the
    new feature of Taverna 1.4 supporting complex
    data structures and the provenance plugin
  • Other Planed Web Services
  • Open Source Chemistry Analysis Routines (OSCAR)
  • Extracts chemical information from text and
    produces an XML instance highlighting the
    chemical information
  • A module of the PMR workflow
  • Status OSCAR3 is available and works fine as a
    Java application
  • To do Studying XML instances for extracting
    chemical names
  • InfoChems SPRESI Web Service
  • Provides access to the SPRESI molecule database
  • Status Perl scripts for accessing SPRESI Web
  • To do Developing a Web service wrapper to
    utilize InfoChems SPRESI Web Service

BCI Clustering URL Service Methods
Service Method Description Input URLOutput
makebitsURLGenerate Generate fingerprints from a SMILES structure SMIstring Fingerprint and program output
divkmURLGenerate Cluster fingerprints with Divkmeans SCNstring DKM data and program output
smile2dkmURL Makebits divkm SMIstring All SMI, DKM and std. outputs
optclusURLGenerate Generate the best levels in a hierarchy SMIstring DKMstring Best data and program output
rnnclusURLGenerate Extract individual cluster partitions SMIstring DKMstring New partition and std. output
smile2ClusterPartitionedURL Generate a new SMILES structure w/ extra col. SMIstring All intermediate data and output
Workflow for smile2ClusterPartitionedURL
Workflow for Toxic Hazard in Verbose
Diagram of Workflow2
Web Services
Beanshell Scripting
  • Informatics is the discipline of science which
    investigates the structure and properties (not
    specific content) of scientific information, as
    well as the regularities of scientific
    information activity, its theory, history,
    methodology and organization. The purpose of
    informatics consists in developing optimal
    methods and means of presentation (recording),
    collection, analytical-synthetic processing,
    storage, retrieval and dissemination of
    scientific information.
  • A. I. Mikhailov, A. I. Chernyi, R. S.
    Gilyarevskii (1967) Informatics -- New Name of
    the Theory of Scientific Information

Chemical informatics is
  • More usually know as chemoinformatics or
  • Very differently defined, reflecting its
    cross-disciplinary nature
  • Librarian
  • Chemist (synthetic, medicinal, theoretical)
  • Biologist / Bioinformatician
  • Molecular modeler
  • Pharmaceutical or Chemical Engineer
  • Computer Scientist / Informatician

More definitions
  • Computational Chemistry The application of
    mathematical and computational methods to
    particularly to theoretical chemistry
  • Molecular Modeling Using 3D graphics and
    optimization techniques to help understand the
    nature and action of compounds and proteins
  • Computer-Aided Drug Design The discipline of
    using computational techniques (including
    chemical informatics) to assist in the discovery
    and design of drugs.

Traditional areas of application
  • Pharmaceutical life science industry
  • particularly in early stage drug design
  • Databases of available chemicals
  • Electronic publishing
  • including searchable chemical structure
    information in journals, etc.
  • Government and patent databases

The ics so far (1960s to present)
  • How do you represent 2D and 3D chemical
  • Not just a pretty picture
  • How do you search databases of chemical
  • Google doesnt help (much, but it might do soon)
  • How do you organize large amounts of chemical
  • How do you visualize chemical structures
  • Can computers predict how chemicals are going to
  • in the test tube?
  • in the body?

Current trends hot topics
  • The decorporatization of chemical informatics
    (PubChem, MLI, eScience, open source)
  • Service-oriented architectures
  • Packaging processing large volumes of complex
    information for human consumption
  • Integration with other ics (bioinformatics,
    genomics, proteomics, systems biology)

Main players (Commercial)
  • MDL
  • Tripos, inc.
  • Accelrys
  • Daylight CIS, inc.

Main players (Academia)
  • Pure Chemoinformatics
  • University of Sheffield, UK (Willett / Gillet)
  • http//
  • Erlangen, Germany (Gasteiger)
  • http//
  • Cambridge Unilever Center
  • http//
  • Indiana University School of Informatics
  • http//
  • Related (computational chemistry, etc.)
  • UCSF (Kuntz)
  • http//
  • University of Texas (Pearlman)
  • http//
  • Yale (Jorgensen)
  • http//
  • University of Michigan (Crippen)
  • http//

Traditional Journals
  • Journal of Chemical Information Modeling
    (formerly JCICS)
  • http//
  • Journal of Computer-Aided Molecular Design
  • http//
  • Journal of Molecular Graphics and Modeling
  • http//
  • Journal of Computational Chemistry
  • http//
  • Journal of Chemical Theory and Computation
  • http//
  • Journal of Medicinal Chemistry
  • http//

Informal publications
  • Network Science (online)
  • http//
  • Chemical Engineering News
  • http//
  • Drug Discovery Today
  • http//
  • Scientific Computing World
  • http//
  • Bio-IT World
  • http//

CINF-L Distribution List
  • Chemical Information Sources Discussion List
  • Created by Gary Wiggins at IUB
  • http//

Yahoo! Chemoinformatics Discussion List
  • For
  • Job postings
  • Ideas exchange
  • Questions
  • Industry Student connections
  • All students encouraged to join
  • Open to others

To join, go to http//
inf Or send an email to chemoinf-subscribe_at_yahoogr
Open Source / Free Software
  • Blue Obelisk - http//
  • InChI - http//
  • JMOL http//
  • FROWNS - http//
  • OpenBabel - http//
  • CML - http//
  • CDK - http//
  • MMTK - http//

Example 23D Visualization Docking
  • 3D Visualization of interactions between
    compounds and proteins
  • Docking compounds into proteins

3D Visualization
  • X-ray crystallography and NMR Spectroscopy can
    reveal 3D structure of protein and bound