Title: Smart mining of drug discovery information using web service workflows
1Smart mining of drug discovery information using
web service workflows
- David Wild
- Assistant Professor of Chemical Informatics
- Indiana University School of Informatics,
Bloomington - djwild _at_ indiana.edu
- March 2006
2About Me
- B.Sc. in Computing Science
- Ph.D. 1994 in Willett group at University of
Sheffield UK GAs and parallel processors for
3D field similarity searching - Postdocs at Sheffield and Parke-Davis (Ann Arbor)
- Senior Scientist at Parke-Davis / Pfizer
Scientific Computing RD group - 2002 Started scientific computing company,
adjunct professorship in Pharm. Eng. at Michigan - 2004 Joined SOI in part time visiting position
3Overview
- Chemical Informatics what and why
- Challenges of diverse sources and large volumes
of information - Our research into using web services,
workflowsand smart agents - Careers in chemical informatics
4Chemical informatics is
- More usually know as chemoinformatics or
cheminformatics - Very differently defined, reflecting its
cross-disciplinary nature - Librarian
- Chemist (synthetic, medicinal, theoretical)
- Biologist / Bioinformatician
- Molecular modeler
- Pharmaceutical or Chemical Engineer
- Computer Scientist / Informatician
5My definition of Chemical Informatics
- Chemical Informatics (a.k.a. chemoinformatics) is
the branch of informatics dealing with all
aspects of the representation and use of chemical
structures, and related information, on computer. - It is an interdisciplinary field of that
regularly pushes the boundaries of computer
science, statistics, visualization methods,
computing power and scientific technique. The
subject covers a wide variety of applications and
specialties, particularly in the pharmaceutical
industry, where the rapid increase in new
technologies in drug discovery puts chemical
informatics at the forefront of drug design. It
is foundational to such diverse applications as
3D molecular modeling, artificial intelligence
biological activity prediction methods, patent
and chemical database searching, and high
throughput screening data analysis.
6More definitions
- Computational Chemistry The application of
mathematical and computational methods to
particularly to theoretical chemistry - Molecular Modeling Using 3D graphics and
optimization techniques to help understand the
nature and action of compounds and proteins - Computer-Aided Drug Design The discipline of
using computational techniques (including
chemical informatics) to assist in the discovery
and design of drugs.
7Chemoinformatics hits on Google
Dec 2005 348,100
April 2005 125,600
July 2000 723
Number of word occurrences on Google, Taken from
http//www.molinspiration.com/chemoinformatics.htm
l
8Virtual screening predicting drug activity
Virtual chemistry learning about the way
compounds work
Analyzing and navigating large volumes of
chemical biological information
9Example 1High-Throughput Screening
Testing perhaps millions of compounds in a
corporate collection to see if any show activity
against a certain disease protein
10High-Throughput Screening
- Traditionally, small numbers of compounds were
tested for a particular project or therapeutic
area - About 10 years ago, technology developed that
enabled large numbers of compounds to be assayed
quickly - High-throughput screening can now test 100,000
compounds a day for activity against a protein
target - Maybe tens of thousands of these compounds will
show some activity for the protein - The chemist needs to intelligently select the 2 -
3 classes of compounds that show the most promise
for being drugs to follow-up
11Informatics Implications
- Need to be able to store chemical structure and
biological data for millions of data points - Computational representation of 2D structure
- Need to be able to organize thousands of active
compounds into meaningful groups - Group similar structures together and relate to
activity cluster analysis - Need to learn as much information as
possible(data mining) - Apply statistical methods to the structures and
related information
12Tools for mining the data
Tripos Benchware HTS Dataminer (formerly SAR
Navigator), www.tripos.com
13Example 23D Visualization Docking
- 3D Visualization of interactions between
compounds and proteins - Docking compounds into proteins
computationally
143D Visualization
- X-ray crystallography and NMR Spectroscopy can
reveal 3D structure of protein and bound
compounds - Visualization of these complexes of proteins
and potential drugs can help scientists
understand the mechanism of action of the drug
and to improve the design of a drug - Visualization uses computational ball and stick
model of atoms and bonds, as well as surfaces - Stereoscopic visualization available
15Visualization Demos (JMOL)
- jmol.sourceforge.net
- www-mslmb.niddk.nih.gov/prag/structures.html
16Docking algorithms
- Require 3D atomic structure for protein, and 3D
structure for compound (ligand) - May require initial rough positioning for the
ligand - Will use an optimization method to try and find
the best rotation and translation of the ligand
in the protein, for optimal binding affinity
17Genetic Algorithms
- Create a population of possible solutions,
encoded as chromosomes - Use fitness function to score solutions
- Good solutions are combined together
(crossover) and altered (mutation) to provide
new solutions - The process repeats until the population
converges on a solution
18Sample GOLD output
19Something fun
- Screensaver that docks molecules while your
computer is idle at - http//www.grid.org/projects/cancer/
20Chemical Informatics Tools
- Databases chemical structure, biological
activity, properties, genomic - Computation docking, pharmacophore generation,
property calculation, energy minimization,
cluster analysis, format conversion, alignment,
2D 3D predictive models (QSAR) - Visualization analysis plotting, spreadsheet
views, nonlinear maps, Kohonen maps, 3D molecular
visualization, structure entry - Mix of sources commercial vendors, open source,
academic code - Very complex to use together
21Vast increase in quantity of information and
number of data sources
- Until last few years, main challenges in chemical
informatics were about quality, but now its
about quality, quantity and managing all the
information effectively - Traditionally an issue for the pharmaceutical
industry but now an issue for academia too (NIH
Roadmap, PubChem) - High Throughput Screening can produce
biological data points for 100,000 compounds per
day - Combinatorial Chemistry a single experiment can
create thousands of new chemicals - Microarray Assays can produce expression
information on, e.g., 14,000 genes for a
drug-treated tissue sample - Computation techniques can dock hundreds of
thousands of compounds into proteins or calculate
millions of properties in a day - Meta information go/no-go decisions, series
decisions, patents, etc.
22Observations about the problem
- Existing approaches do not scale up
- Scientists questions are not that complex, but
finding the answers is currently very time
consuming and/or complex (for a human) - has anybody patented this chemical structure I
just made? - can I get hold of a compound that might bind to
the active site of this protein I just resolved? - which compounds in this series are least likely
to exhibit toxic effects? - Answers are often stale after a short period of
time questions need to be re-answered as new
information is generated - Almost all available systems are passive, and
follow the(web) browsing model - There tends to be one interface for every data
source(or encompassing just a few)
23 4 categories 72 advertisements 1,000 words X
50 newspapers accessed in different ways?
24However large an array of facts, however rapidly
they accumulate, it is possible to keep them in
order and to extract from time to time digests
containing the most generally significant informat
ion, while indicating how to find those items
of specialized interest. To do so, however,
requires the will and the means we need to
get the best information in the minimum
quantity in the shortest time, from the people
who are producing the information to the people
who want it, whether they know they want it or
not J.D. Bernal, quoted in Murray-Rust et. al.,
Org. Biomol. Chem., 2004, 2, 3192-3203
25The aim
- An open-source prototype that implements a new
model of data mining that would, on request,
push relevant information to pharmaceutical
scientists in response to previously-defined
straightforward expressions of needs, rather than
relying on them stumbling upon the right
information using traditional browsing models.
26(No Transcript)
27(No Transcript)
283-layer model
29Web Services
- Semantic Web Next Big Thing
- Encode semantics in web pages (XML)
- Describes services as well as information (SOAP,
WSDL, UDDI) - Computation detached from interface
- Note seeping through to general web usage
- http//www.google.com/apis/
- http//www.amazon.com/webservices
- eScience (UK)
- 200m over 2001-2006 period
- http//www.rcuk.ac.uk/escience/
- Cyber Infrastructure / Grid (US)
30Request from Human Interface
USE-CASE SCRIPT Invoke New Structure
Service Convert structures to 3D Dock results
protein file Extract any hits Return links for
visualization
AGENT / SMART CLIENT Parse request Select
appropriate use cases and/or web
service(s) Schedule as necessary
UDDI (?)
WSDL
SOAP
Online database (e.g. PubChem)
Local database
3D Docking Tool
2D-3D converter
3D visualizer
New Structure Service Search online
databases for recent structures Search local
databases for recent structures Merge Results
atomic services
aggregate services
31Prototype development plan
- Develop a handful of use-cases based around
industry/academia scientists - Build 5-6 data / computation sources (e.g.
enumeration, property calculation, structure
database) that can fulfill the use cases - Build WSDL and SOAP web services around the data
sources that can be accessed from Taverna - Develop workflows in Taverna (see
taverna.sourceforge.net) - Publish web services in UDDI
- Encode use-cases into scripts
- Build Intelligent Agent / Smart Client node that
can match user needs with scripts web services
using workflows - Develop browser interface through Contextual
Inquiry/Usability Studies - Consider mapping to a Natural Language Interface
32(No Transcript)
33Technology
- Perl SOAPLite
- Will be used for initial web service development
- Doesnt really implement WSDL UDDI
- Apache Axis Tomcat
- Deploy WSDL for web services
- BPEL4WS Business Process Execution Language
- For aggregation of web services
- http//www-128.ibm.com/developerworks/library/spec
ification/ws-bpel/ - Microsoft .NET C
34A 2D structure is supplied for input into the
similarity search (in this case, the extracted
bound ligand from the PDB IY4 complex)
A protein implicated in tumor growth is supplied
to the docking program (in this case HSP90 taken
from the PDB 1Y4 complex)
Correlation of docking results and biological
fingerprints across the human tumor cell lines
can help identify potential mechanisms of action
of DTP compounds
The workflow employs our local NIH DTP database
service to search 200,000 compounds tested in
human tumor cellular assays for similar
structures to the ligand. Client portlets are
used to browse these structures
Once docking is complete, the user visualizes the
high-scoring docked structures in a portlet using
the JMOL applet.
Similar structures are filtered for drugability,
and are automatically passed to the OpenEye FRED
docking program for docking into the target
protein.
35Search hits are docked into Acetylcholineesterase
and Visualized using JMOL
Hits returned from similarity search on Donepezil
in NIH DTP database using web service or SQL link
Information retrieved from PubChem using Expert
Query
Expert Query can be used to retrieve related
genomic Information from the web
36Funding at Indiana
- 500,000 NIH grant to develop an exploratory
Chemical Informatics Cyberinfrastructure
Collaboratory over two years. Partnership with
Informatics and Community Grids Lab. Developing
web services and workflows particularly relating
to PubChem and HTS analysis. May lead to 10m
funding for full center. See www.chembiogrid.org.
- 49,000 Microsoft Research Smart Clients for
eScience grant for development of this system - For more information, see
- http//www.informatics.indiana.edu/djwild
37PubChem
- pubchem.ncbi.nlm.nih.gov
- Currently contains 10,096,336 chemical structures
(3/30/06) and growing rapidly - Likely to be a (or the) major worldwide source of
chemical information unless litigation
restricts it (CAS) - Includes the latest InChI representations
- Can be queried on chemical structure, exposed as
web services (SOAP) - Linked in with MLI project and biological data
from HTS experiments
38Supplemental slides
39Cluster Analysis and Chemical Informatics
- Used for organizing datasets into chemical
series, to build predictive models, or to select
representative compounds - Organizational usage has not been as well studies
as the other two, but see - Wild, D.J., Blankley, C.J. Comparison of 2D
Fingerprint Types and Hierarchy Level Selection
Methods for Structural Grouping using Wards
Clustering, Journal of Chemical Information and
Computer Sciences., 2000, 40, 155-162. - Essentially helping large datasets become
manageable - Methods used
- Jarvis-Patrick and variants
- O(N2), single partition
- Wards method
- Hierarchical, regarded as best, but at least
O(N2) - K-means
- lt O(N2), requires set no of clusters, a little
messy - Sphere-exclusion (Butina)
- Fast, simple, similar to JP
- Kohonen network
- Clusters arranged in 2D grid, ideal for
visualization
40Limitations of Wards method forlarge datasets
(gt1m)
- Best algorithms have O(N2) time requirement (RNN)
- Requires random access to fingerprints
- hence substantial memory requirements (O(N))
- Problem of selection of best partition
- can select desired number of clusters
- Easily hit 4GB memory addressing limit on 32 bit
machines - Approximately 2m compounds
41Scaling up clustering methods
- Parallelisation
- Clustering algorithms can be adapted for multiple
processors - Some algorithms more appropriate than others for
particular architectures - Wards has been parallelized for shared memory
machines, but overhead considerable - New methods and algorithms
- Divisive (bisecting) K-means method
- Hierarchical Divisive
- Approx. O(NlogN)
42Divisive K-means Clustering
- New hierarchical divisive method
- Hierarchy built from top down, instead of bottom
up - Divide complete dataset into two clusters
- Continue dividing until all items are singletons
- Each binary division done using K-means method
- Originally proposed for document clustering
- Bisecting K-means
- Steinbach, Karypis and Kumar (Univ.
Minnesota)http//www-users.cs.umn.edu/karypis/pu
blications/Papers/PDF/doccluster.pdf - Found to be more effective than agglomerative
methods - Forms more uniformly-sized clusters at given
level
43BCI Divkmeans
- Several options for detailed operation
- Selection of next cluster for division
- size, variance, diameter
- affects selection of partitions from hierarchy,
not shape of hierarchy - Options within each K-means division step
- distance measure
- choice of seeds
- batch-mode or continuous update of centroids
- termination criterion
- Have developed parallel version for Linux
clusters / grids in conjunction with BCI - For more information, see Barnard and Engels
talks at http//cisrg.shef.ac.uk/shef2004/confere
nce.htm
44Comparative execution timesNCI subsets, 2.2 GHz
Intel Celeron processor
7h 27m
3h 06m
2h 25m
44m
45Clustering a 1 million compound dataseton a 2.2
GHz Celeron Desktop Machine
Results from AVIDD clusters Teragrid coming
soon.
Time for a single run may vary due to
different selection of seeds. Runtimes can be
shortened e.g. by using a max. number of
iterations or a relocation cutoff.
46Divisive Kmeans Conclusions
- Much faster than Wards, speed comparable to
K-means, suitable for very large datasets
(millions) - Time requirements approximately O(N log N)
- Current implementation can cluster 1m compounds
in under a week on a low-power desktop PC - Cluster 1m compounds in a few hours with a 4-node
parallel Linux cluster - Better balance of cluster sizes than Wards or
Kmeans - Visual inspection of clusters suggests better
assembly of compound series than other methods - Better clustering of actives together than
previously-studied methods - Memory requirements minimal
- Experiments using AVIDD cluster and Teragrid
forthcoming(50 nodes)
47Visualization interface level tools
- No matter how clever the smarts underneath, the
overriding factor in usefulness will be the
quality of scientists interaction with the
system - Contextual Design, Interaction Design (Cooper)
and Usability Studies have proven effective in
designing the right interfaces for the right
peoplein chemical informatics collaboration
with HCI? - Possibility of multiple interfaces for different
people groups(Coopers primary personas) - Dont assume the browser interface email / NLP
? - Start with the basics
- 2D chemical structure drawing (input)
- Visualization of large numbers of chemical
structures in 2D - 3D chemical structure visualization
- Planning on evaluation of NLP, email, RSS, etc.
as well asbrowser-based interfaces
48Usability of 2D structure drawing tools
- Key difference between sequential and random
drawers - Huge difference in intuitiveness
- Key factor how badly you can mess things up
- Marvin Sketch JME gt ChemDraw gtgt ISIS Draw
49Visualization methods for datasets clusters
- Partitions
- Spreadsheets
- Enhanced Spreadsheets
- 2D or 3D plots
- Hierarchies
- Dendograms
- Tree Maps
- Hyperbolic Maps
50(No Transcript)
51(No Transcript)
52VisualiSAR with a nod to Edward Tufte. See
http//www.daylight.com/meetings/mug99/Wild/Mug99.
html
53Tree Maps very Tufte-esque
543D Visualization - JMOL
- Open Source, very flexible, works in a web
service environment jmol.sourceforge.net
55Sentient - an alternative approachto managing
heterogenous data sources
- Collaboration with IO-Informatics (along with
Cornell, and UCSD) for the investigation of
service-oriented architectures in life sciences
research using Sentient software - Aim to integrate several sources of information
relating to Alzheimers Disease (brain imaging,
morphology, gene expression) so that
cross-dataset biomarkers can be identified - Sentient usies Intelligent Multidimensional
Objects (IMOs) to define and query data sources
and the tools used toaccess them - Still a browsing approach, but with a layer of
coherenceand intelligence - Hope to expand to include chemistry data
- Can also be used as an interface-level tool
56(No Transcript)
57(No Transcript)