Smart mining of drug discovery information using web service workflows - PowerPoint PPT Presentation

1 / 57
About This Presentation
Title:

Smart mining of drug discovery information using web service workflows

Description:

Indiana University School of. About Me. B.Sc. in Computing Science ... Indiana University School of. My definition of Chemical Informatics ... – PowerPoint PPT presentation

Number of Views:87
Avg rating:3.0/5.0
Slides: 58
Provided by: david440
Category:

less

Transcript and Presenter's Notes

Title: Smart mining of drug discovery information using web service workflows


1
Smart mining of drug discovery information using
web service workflows
  • David Wild
  • Assistant Professor of Chemical Informatics
  • Indiana University School of Informatics,
    Bloomington
  • djwild _at_ indiana.edu
  • March 2006

2
About Me
  • B.Sc. in Computing Science
  • Ph.D. 1994 in Willett group at University of
    Sheffield UK GAs and parallel processors for
    3D field similarity searching
  • Postdocs at Sheffield and Parke-Davis (Ann Arbor)
  • Senior Scientist at Parke-Davis / Pfizer
    Scientific Computing RD group
  • 2002 Started scientific computing company,
    adjunct professorship in Pharm. Eng. at Michigan
  • 2004 Joined SOI in part time visiting position

3
Overview
  • Chemical Informatics what and why
  • Challenges of diverse sources and large volumes
    of information
  • Our research into using web services,
    workflowsand smart agents
  • Careers in chemical informatics

4
Chemical informatics is
  • More usually know as chemoinformatics or
    cheminformatics
  • Very differently defined, reflecting its
    cross-disciplinary nature
  • Librarian
  • Chemist (synthetic, medicinal, theoretical)
  • Biologist / Bioinformatician
  • Molecular modeler
  • Pharmaceutical or Chemical Engineer
  • Computer Scientist / Informatician

5
My definition of Chemical Informatics
  • Chemical Informatics (a.k.a. chemoinformatics) is
    the branch of informatics dealing with all
    aspects of the representation and use of chemical
    structures, and related information, on computer.
  • It is an interdisciplinary field of that
    regularly pushes the boundaries of computer
    science, statistics, visualization methods,
    computing power and scientific technique. The
    subject covers a wide variety of applications and
    specialties, particularly in the pharmaceutical
    industry, where the rapid increase in new
    technologies in drug discovery puts chemical
    informatics at the forefront of drug design. It
    is foundational to such diverse applications as
    3D molecular modeling, artificial intelligence
    biological activity prediction methods, patent
    and chemical database searching, and high
    throughput screening data analysis.

6
More definitions
  • Computational Chemistry The application of
    mathematical and computational methods to
    particularly to theoretical chemistry
  • Molecular Modeling Using 3D graphics and
    optimization techniques to help understand the
    nature and action of compounds and proteins
  • Computer-Aided Drug Design The discipline of
    using computational techniques (including
    chemical informatics) to assist in the discovery
    and design of drugs.

7
Chemoinformatics hits on Google
Dec 2005 348,100
April 2005 125,600
July 2000 723

Number of word occurrences on Google, Taken from
http//www.molinspiration.com/chemoinformatics.htm
l
8
Virtual screening predicting drug activity
Virtual chemistry learning about the way
compounds work
Analyzing and navigating large volumes of
chemical biological information
9
Example 1High-Throughput Screening
Testing perhaps millions of compounds in a
corporate collection to see if any show activity
against a certain disease protein
10
High-Throughput Screening
  • Traditionally, small numbers of compounds were
    tested for a particular project or therapeutic
    area
  • About 10 years ago, technology developed that
    enabled large numbers of compounds to be assayed
    quickly
  • High-throughput screening can now test 100,000
    compounds a day for activity against a protein
    target
  • Maybe tens of thousands of these compounds will
    show some activity for the protein
  • The chemist needs to intelligently select the 2 -
    3 classes of compounds that show the most promise
    for being drugs to follow-up

11
Informatics Implications
  • Need to be able to store chemical structure and
    biological data for millions of data points
  • Computational representation of 2D structure
  • Need to be able to organize thousands of active
    compounds into meaningful groups
  • Group similar structures together and relate to
    activity cluster analysis
  • Need to learn as much information as
    possible(data mining)
  • Apply statistical methods to the structures and
    related information

12
Tools for mining the data
Tripos Benchware HTS Dataminer (formerly SAR
Navigator), www.tripos.com
13
Example 23D Visualization Docking
  • 3D Visualization of interactions between
    compounds and proteins
  • Docking compounds into proteins
    computationally

14
3D Visualization
  • X-ray crystallography and NMR Spectroscopy can
    reveal 3D structure of protein and bound
    compounds
  • Visualization of these complexes of proteins
    and potential drugs can help scientists
    understand the mechanism of action of the drug
    and to improve the design of a drug
  • Visualization uses computational ball and stick
    model of atoms and bonds, as well as surfaces
  • Stereoscopic visualization available

15
Visualization Demos (JMOL)
  • jmol.sourceforge.net
  • www-mslmb.niddk.nih.gov/prag/structures.html

16
Docking algorithms
  • Require 3D atomic structure for protein, and 3D
    structure for compound (ligand)
  • May require initial rough positioning for the
    ligand
  • Will use an optimization method to try and find
    the best rotation and translation of the ligand
    in the protein, for optimal binding affinity

17
Genetic Algorithms
  • Create a population of possible solutions,
    encoded as chromosomes
  • Use fitness function to score solutions
  • Good solutions are combined together
    (crossover) and altered (mutation) to provide
    new solutions
  • The process repeats until the population
    converges on a solution

18
Sample GOLD output
  • GMP into RNaseT1

19
Something fun
  • Screensaver that docks molecules while your
    computer is idle at
  • http//www.grid.org/projects/cancer/

20
Chemical Informatics Tools
  • Databases chemical structure, biological
    activity, properties, genomic
  • Computation docking, pharmacophore generation,
    property calculation, energy minimization,
    cluster analysis, format conversion, alignment,
    2D 3D predictive models (QSAR)
  • Visualization analysis plotting, spreadsheet
    views, nonlinear maps, Kohonen maps, 3D molecular
    visualization, structure entry
  • Mix of sources commercial vendors, open source,
    academic code
  • Very complex to use together

21
Vast increase in quantity of information and
number of data sources
  • Until last few years, main challenges in chemical
    informatics were about quality, but now its
    about quality, quantity and managing all the
    information effectively
  • Traditionally an issue for the pharmaceutical
    industry but now an issue for academia too (NIH
    Roadmap, PubChem)
  • High Throughput Screening can produce
    biological data points for 100,000 compounds per
    day
  • Combinatorial Chemistry a single experiment can
    create thousands of new chemicals
  • Microarray Assays can produce expression
    information on, e.g., 14,000 genes for a
    drug-treated tissue sample
  • Computation techniques can dock hundreds of
    thousands of compounds into proteins or calculate
    millions of properties in a day
  • Meta information go/no-go decisions, series
    decisions, patents, etc.

22
Observations about the problem
  • Existing approaches do not scale up
  • Scientists questions are not that complex, but
    finding the answers is currently very time
    consuming and/or complex (for a human)
  • has anybody patented this chemical structure I
    just made?
  • can I get hold of a compound that might bind to
    the active site of this protein I just resolved?
  • which compounds in this series are least likely
    to exhibit toxic effects?
  • Answers are often stale after a short period of
    time questions need to be re-answered as new
    information is generated
  • Almost all available systems are passive, and
    follow the(web) browsing model
  • There tends to be one interface for every data
    source(or encompassing just a few)

23
4 categories 72 advertisements 1,000 words X
50 newspapers accessed in different ways?
24
However large an array of facts, however rapidly
they accumulate, it is possible to keep them in
order and to extract from time to time digests
containing the most generally significant informat
ion, while indicating how to find those items
of specialized interest. To do so, however,
requires the will and the means we need to
get the best information in the minimum
quantity in the shortest time, from the people
who are producing the information to the people
who want it, whether they know they want it or
not J.D. Bernal, quoted in Murray-Rust et. al.,
Org. Biomol. Chem., 2004, 2, 3192-3203
25
The aim
  • An open-source prototype that implements a new
    model of data mining that would, on request,
    push relevant information to pharmaceutical
    scientists in response to previously-defined
    straightforward expressions of needs, rather than
    relying on them stumbling upon the right
    information using traditional browsing models.

26
(No Transcript)
27
(No Transcript)
28
3-layer model
29
Web Services
  • Semantic Web Next Big Thing
  • Encode semantics in web pages (XML)
  • Describes services as well as information (SOAP,
    WSDL, UDDI)
  • Computation detached from interface
  • Note seeping through to general web usage
  • http//www.google.com/apis/
  • http//www.amazon.com/webservices
  • eScience (UK)
  • 200m over 2001-2006 period
  • http//www.rcuk.ac.uk/escience/
  • Cyber Infrastructure / Grid (US)

30
Request from Human Interface
USE-CASE SCRIPT Invoke New Structure
Service Convert structures to 3D Dock results
protein file Extract any hits Return links for
visualization
AGENT / SMART CLIENT Parse request Select
appropriate use cases and/or web
service(s) Schedule as necessary
UDDI (?)
WSDL
SOAP
Online database (e.g. PubChem)
Local database
3D Docking Tool
2D-3D converter
3D visualizer
New Structure Service Search online
databases for recent structures Search local
databases for recent structures Merge Results
atomic services
aggregate services
31
Prototype development plan
  • Develop a handful of use-cases based around
    industry/academia scientists
  • Build 5-6 data / computation sources (e.g.
    enumeration, property calculation, structure
    database) that can fulfill the use cases
  • Build WSDL and SOAP web services around the data
    sources that can be accessed from Taverna
  • Develop workflows in Taverna (see
    taverna.sourceforge.net)
  • Publish web services in UDDI
  • Encode use-cases into scripts
  • Build Intelligent Agent / Smart Client node that
    can match user needs with scripts web services
    using workflows
  • Develop browser interface through Contextual
    Inquiry/Usability Studies
  • Consider mapping to a Natural Language Interface

32
(No Transcript)
33
Technology
  • Perl SOAPLite
  • Will be used for initial web service development
  • Doesnt really implement WSDL UDDI
  • Apache Axis Tomcat
  • Deploy WSDL for web services
  • BPEL4WS Business Process Execution Language
  • For aggregation of web services
  • http//www-128.ibm.com/developerworks/library/spec
    ification/ws-bpel/
  • Microsoft .NET C

34
A 2D structure is supplied for input into the
similarity search (in this case, the extracted
bound ligand from the PDB IY4 complex)
A protein implicated in tumor growth is supplied
to the docking program (in this case HSP90 taken
from the PDB 1Y4 complex)
Correlation of docking results and biological
fingerprints across the human tumor cell lines
can help identify potential mechanisms of action
of DTP compounds
The workflow employs our local NIH DTP database
service to search 200,000 compounds tested in
human tumor cellular assays for similar
structures to the ligand. Client portlets are
used to browse these structures
Once docking is complete, the user visualizes the
high-scoring docked structures in a portlet using
the JMOL applet.
Similar structures are filtered for drugability,
and are automatically passed to the OpenEye FRED
docking program for docking into the target
protein.
35
Search hits are docked into Acetylcholineesterase
and Visualized using JMOL
Hits returned from similarity search on Donepezil
in NIH DTP database using web service or SQL link
Information retrieved from PubChem using Expert
Query
Expert Query can be used to retrieve related
genomic Information from the web
36
Funding at Indiana
  • 500,000 NIH grant to develop an exploratory
    Chemical Informatics Cyberinfrastructure
    Collaboratory over two years. Partnership with
    Informatics and Community Grids Lab. Developing
    web services and workflows particularly relating
    to PubChem and HTS analysis. May lead to 10m
    funding for full center. See www.chembiogrid.org.
  • 49,000 Microsoft Research Smart Clients for
    eScience grant for development of this system
  • For more information, see
  • http//www.informatics.indiana.edu/djwild

37
PubChem
  • pubchem.ncbi.nlm.nih.gov
  • Currently contains 10,096,336 chemical structures
    (3/30/06) and growing rapidly
  • Likely to be a (or the) major worldwide source of
    chemical information unless litigation
    restricts it (CAS)
  • Includes the latest InChI representations
  • Can be queried on chemical structure, exposed as
    web services (SOAP)
  • Linked in with MLI project and biological data
    from HTS experiments

38
Supplemental slides
39
Cluster Analysis and Chemical Informatics
  • Used for organizing datasets into chemical
    series, to build predictive models, or to select
    representative compounds
  • Organizational usage has not been as well studies
    as the other two, but see
  • Wild, D.J., Blankley, C.J. Comparison of 2D
    Fingerprint Types and Hierarchy Level Selection
    Methods for Structural Grouping using Wards
    Clustering, Journal of Chemical Information and
    Computer Sciences., 2000, 40, 155-162.
  • Essentially helping large datasets become
    manageable
  • Methods used
  • Jarvis-Patrick and variants
  • O(N2), single partition
  • Wards method
  • Hierarchical, regarded as best, but at least
    O(N2)
  • K-means
  • lt O(N2), requires set no of clusters, a little
    messy
  • Sphere-exclusion (Butina)
  • Fast, simple, similar to JP
  • Kohonen network
  • Clusters arranged in 2D grid, ideal for
    visualization

40
Limitations of Wards method forlarge datasets
(gt1m)
  • Best algorithms have O(N2) time requirement (RNN)
  • Requires random access to fingerprints
  • hence substantial memory requirements (O(N))
  • Problem of selection of best partition
  • can select desired number of clusters
  • Easily hit 4GB memory addressing limit on 32 bit
    machines
  • Approximately 2m compounds

41
Scaling up clustering methods
  • Parallelisation
  • Clustering algorithms can be adapted for multiple
    processors
  • Some algorithms more appropriate than others for
    particular architectures
  • Wards has been parallelized for shared memory
    machines, but overhead considerable
  • New methods and algorithms
  • Divisive (bisecting) K-means method
  • Hierarchical Divisive
  • Approx. O(NlogN)

42
Divisive K-means Clustering
  • New hierarchical divisive method
  • Hierarchy built from top down, instead of bottom
    up
  • Divide complete dataset into two clusters
  • Continue dividing until all items are singletons
  • Each binary division done using K-means method
  • Originally proposed for document clustering
  • Bisecting K-means
  • Steinbach, Karypis and Kumar (Univ.
    Minnesota)http//www-users.cs.umn.edu/karypis/pu
    blications/Papers/PDF/doccluster.pdf
  • Found to be more effective than agglomerative
    methods
  • Forms more uniformly-sized clusters at given
    level

43
BCI Divkmeans
  • Several options for detailed operation
  • Selection of next cluster for division
  • size, variance, diameter
  • affects selection of partitions from hierarchy,
    not shape of hierarchy
  • Options within each K-means division step
  • distance measure
  • choice of seeds
  • batch-mode or continuous update of centroids
  • termination criterion
  • Have developed parallel version for Linux
    clusters / grids in conjunction with BCI
  • For more information, see Barnard and Engels
    talks at http//cisrg.shef.ac.uk/shef2004/confere
    nce.htm

44
Comparative execution timesNCI subsets, 2.2 GHz
Intel Celeron processor
7h 27m
3h 06m
2h 25m
44m
45
Clustering a 1 million compound dataseton a 2.2
GHz Celeron Desktop Machine
Results from AVIDD clusters Teragrid coming
soon.
Time for a single run may vary due to
different selection of seeds. Runtimes can be
shortened e.g. by using a max. number of
iterations or a relocation cutoff.
46
Divisive Kmeans Conclusions
  • Much faster than Wards, speed comparable to
    K-means, suitable for very large datasets
    (millions)
  • Time requirements approximately O(N log N)
  • Current implementation can cluster 1m compounds
    in under a week on a low-power desktop PC
  • Cluster 1m compounds in a few hours with a 4-node
    parallel Linux cluster
  • Better balance of cluster sizes than Wards or
    Kmeans
  • Visual inspection of clusters suggests better
    assembly of compound series than other methods
  • Better clustering of actives together than
    previously-studied methods
  • Memory requirements minimal
  • Experiments using AVIDD cluster and Teragrid
    forthcoming(50 nodes)

47
Visualization interface level tools
  • No matter how clever the smarts underneath, the
    overriding factor in usefulness will be the
    quality of scientists interaction with the
    system
  • Contextual Design, Interaction Design (Cooper)
    and Usability Studies have proven effective in
    designing the right interfaces for the right
    peoplein chemical informatics collaboration
    with HCI?
  • Possibility of multiple interfaces for different
    people groups(Coopers primary personas)
  • Dont assume the browser interface email / NLP
    ?
  • Start with the basics
  • 2D chemical structure drawing (input)
  • Visualization of large numbers of chemical
    structures in 2D
  • 3D chemical structure visualization
  • Planning on evaluation of NLP, email, RSS, etc.
    as well asbrowser-based interfaces

48
Usability of 2D structure drawing tools
  • Key difference between sequential and random
    drawers
  • Huge difference in intuitiveness
  • Key factor how badly you can mess things up
  • Marvin Sketch JME gt ChemDraw gtgt ISIS Draw

49
Visualization methods for datasets clusters
  • Partitions
  • Spreadsheets
  • Enhanced Spreadsheets
  • 2D or 3D plots
  • Hierarchies
  • Dendograms
  • Tree Maps
  • Hyperbolic Maps

50
(No Transcript)
51
(No Transcript)
52
VisualiSAR with a nod to Edward Tufte. See
http//www.daylight.com/meetings/mug99/Wild/Mug99.
html
53
Tree Maps very Tufte-esque
54
3D Visualization - JMOL
  • Open Source, very flexible, works in a web
    service environment jmol.sourceforge.net

55
Sentient - an alternative approachto managing
heterogenous data sources
  • Collaboration with IO-Informatics (along with
    Cornell, and UCSD) for the investigation of
    service-oriented architectures in life sciences
    research using Sentient software
  • Aim to integrate several sources of information
    relating to Alzheimers Disease (brain imaging,
    morphology, gene expression) so that
    cross-dataset biomarkers can be identified
  • Sentient usies Intelligent Multidimensional
    Objects (IMOs) to define and query data sources
    and the tools used toaccess them
  • Still a browsing approach, but with a layer of
    coherenceand intelligence
  • Hope to expand to include chemistry data
  • Can also be used as an interface-level tool

56
(No Transcript)
57
(No Transcript)
Write a Comment
User Comments (0)
About PowerShow.com