Smart mining of drug discovery information using web service workflows

About This Presentation

Title:

Smart mining of drug discovery information using web service workflows

Description:

Indiana University School of. About Me. B.Sc. in Computing Science ... Indiana University School of. My definition of Chemical Informatics ... – PowerPoint PPT presentation

Number of Views:87

Avg rating:3.0/5.0

Slides: 58

Provided by: david440

Category:

more less

Transcript and Presenter's Notes

Title: Smart mining of drug discovery information using web service workflows

1
Smart mining of drug discovery information using
web service workflows

David Wild
Assistant Professor of Chemical Informatics
Indiana University School of Informatics,
Bloomington
djwild _at_ indiana.edu
March 2006

2
About Me

B.Sc. in Computing Science
Ph.D. 1994 in Willett group at University of
Sheffield UK GAs and parallel processors for
3D field similarity searching
Postdocs at Sheffield and Parke-Davis (Ann Arbor)
Senior Scientist at Parke-Davis / Pfizer
Scientific Computing RD group
2002 Started scientific computing company,
adjunct professorship in Pharm. Eng. at Michigan
2004 Joined SOI in part time visiting position

3
Overview

Chemical Informatics what and why
Challenges of diverse sources and large volumes
of information
Our research into using web services,
workflowsand smart agents
Careers in chemical informatics

4
Chemical informatics is

More usually know as chemoinformatics or
cheminformatics
Very differently defined, reflecting its
cross-disciplinary nature
Librarian
Chemist (synthetic, medicinal, theoretical)
Biologist / Bioinformatician
Molecular modeler
Pharmaceutical or Chemical Engineer
Computer Scientist / Informatician

5
My definition of Chemical Informatics

Chemical Informatics (a.k.a. chemoinformatics) is
the branch of informatics dealing with all
aspects of the representation and use of chemical
structures, and related information, on computer.
It is an interdisciplinary field of that
regularly pushes the boundaries of computer
science, statistics, visualization methods,
computing power and scientific technique. The
subject covers a wide variety of applications and
specialties, particularly in the pharmaceutical
industry, where the rapid increase in new
technologies in drug discovery puts chemical
informatics at the forefront of drug design. It
is foundational to such diverse applications as
3D molecular modeling, artificial intelligence
biological activity prediction methods, patent
and chemical database searching, and high
throughput screening data analysis.

6
More definitions

Computational Chemistry The application of
mathematical and computational methods to
particularly to theoretical chemistry
Molecular Modeling Using 3D graphics and
optimization techniques to help understand the
nature and action of compounds and proteins
Computer-Aided Drug Design The discipline of
using computational techniques (including
chemical informatics) to assist in the discovery
and design of drugs.

7
Chemoinformatics hits on Google
Dec 2005 348,100
April 2005 125,600
July 2000 723

Number of word occurrences on Google, Taken from
http//www.molinspiration.com/chemoinformatics.htm
l
8
Virtual screening predicting drug activity
Virtual chemistry learning about the way
compounds work
Analyzing and navigating large volumes of
chemical biological information
9
Example 1High-Throughput Screening
Testing perhaps millions of compounds in a
corporate collection to see if any show activity
against a certain disease protein
10
High-Throughput Screening

Traditionally, small numbers of compounds were
tested for a particular project or therapeutic
area
About 10 years ago, technology developed that
enabled large numbers of compounds to be assayed
quickly
High-throughput screening can now test 100,000
compounds a day for activity against a protein
target
Maybe tens of thousands of these compounds will
show some activity for the protein
The chemist needs to intelligently select the 2 -
3 classes of compounds that show the most promise
for being drugs to follow-up

11
Informatics Implications

Need to be able to store chemical structure and
biological data for millions of data points
Computational representation of 2D structure
Need to be able to organize thousands of active
compounds into meaningful groups
Group similar structures together and relate to
activity cluster analysis
Need to learn as much information as
possible(data mining)
Apply statistical methods to the structures and
related information

12
Tools for mining the data
Tripos Benchware HTS Dataminer (formerly SAR
Navigator), www.tripos.com
13
Example 23D Visualization Docking

3D Visualization of interactions between
compounds and proteins
Docking compounds into proteins
computationally

14
3D Visualization

X-ray crystallography and NMR Spectroscopy can
reveal 3D structure of protein and bound
compounds
Visualization of these complexes of proteins
and potential drugs can help scientists
understand the mechanism of action of the drug
and to improve the design of a drug
Visualization uses computational ball and stick
model of atoms and bonds, as well as surfaces
Stereoscopic visualization available

15
Visualization Demos (JMOL)

jmol.sourceforge.net
www-mslmb.niddk.nih.gov/prag/structures.html

16
Docking algorithms

Require 3D atomic structure for protein, and 3D
structure for compound (ligand)
May require initial rough positioning for the
ligand
Will use an optimization method to try and find
the best rotation and translation of the ligand
in the protein, for optimal binding affinity

17
Genetic Algorithms

Create a population of possible solutions,
encoded as chromosomes
Use fitness function to score solutions
Good solutions are combined together
(crossover) and altered (mutation) to provide
new solutions
The process repeats until the population
converges on a solution

18
Sample GOLD output

GMP into RNaseT1

19
Something fun

Screensaver that docks molecules while your
computer is idle at
http//www.grid.org/projects/cancer/

20
Chemical Informatics Tools

Databases chemical structure, biological
activity, properties, genomic
Computation docking, pharmacophore generation,
property calculation, energy minimization,
cluster analysis, format conversion, alignment,
2D 3D predictive models (QSAR)
Visualization analysis plotting, spreadsheet
views, nonlinear maps, Kohonen maps, 3D molecular
visualization, structure entry
Mix of sources commercial vendors, open source,
academic code
Very complex to use together

21
Vast increase in quantity of information and
number of data sources

Until last few years, main challenges in chemical
informatics were about quality, but now its
about quality, quantity and managing all the
information effectively
Traditionally an issue for the pharmaceutical
industry but now an issue for academia too (NIH
Roadmap, PubChem)
High Throughput Screening can produce
biological data points for 100,000 compounds per
day
Combinatorial Chemistry a single experiment can
create thousands of new chemicals
Microarray Assays can produce expression
information on, e.g., 14,000 genes for a
drug-treated tissue sample
Computation techniques can dock hundreds of
thousands of compounds into proteins or calculate
millions of properties in a day
Meta information go/no-go decisions, series
decisions, patents, etc.

22
Observations about the problem

Existing approaches do not scale up
Scientists questions are not that complex, but
finding the answers is currently very time
consuming and/or complex (for a human)
has anybody patented this chemical structure I
just made?
can I get hold of a compound that might bind to
the active site of this protein I just resolved?
which compounds in this series are least likely
to exhibit toxic effects?
Answers are often stale after a short period of
time questions need to be re-answered as new
information is generated
Almost all available systems are passive, and
follow the(web) browsing model
There tends to be one interface for every data
source(or encompassing just a few)

23
4 categories 72 advertisements 1,000 words X
50 newspapers accessed in different ways?
24
However large an array of facts, however rapidly
they accumulate, it is possible to keep them in
order and to extract from time to time digests
containing the most generally significant informat
ion, while indicating how to find those items
of specialized interest. To do so, however,
requires the will and the means we need to
get the best information in the minimum
quantity in the shortest time, from the people
who are producing the information to the people
who want it, whether they know they want it or
not J.D. Bernal, quoted in Murray-Rust et. al.,
Org. Biomol. Chem., 2004, 2, 3192-3203
25
The aim

An open-source prototype that implements a new
model of data mining that would, on request,
push relevant information to pharmaceutical
scientists in response to previously-defined
straightforward expressions of needs, rather than
relying on them stumbling upon the right
information using traditional browsing models.

26
(No Transcript)
27
(No Transcript)
28
3-layer model
29
Web Services

Semantic Web Next Big Thing
Encode semantics in web pages (XML)
Describes services as well as information (SOAP,
WSDL, UDDI)
Computation detached from interface
Note seeping through to general web usage
http//www.google.com/apis/
http//www.amazon.com/webservices
eScience (UK)
200m over 2001-2006 period
http//www.rcuk.ac.uk/escience/
Cyber Infrastructure / Grid (US)

30
Request from Human Interface
USE-CASE SCRIPT Invoke New Structure
Service Convert structures to 3D Dock results
protein file Extract any hits Return links for
visualization
AGENT / SMART CLIENT Parse request Select
appropriate use cases and/or web
service(s) Schedule as necessary
UDDI (?)
WSDL
SOAP
Online database (e.g. PubChem)
Local database
3D Docking Tool
2D-3D converter
3D visualizer
New Structure Service Search online
databases for recent structures Search local
databases for recent structures Merge Results
atomic services
aggregate services
31
Prototype development plan

Develop a handful of use-cases based around
industry/academia scientists
Build 5-6 data / computation sources (e.g.
enumeration, property calculation, structure
database) that can fulfill the use cases
Build WSDL and SOAP web services around the data
sources that can be accessed from Taverna
Develop workflows in Taverna (see
taverna.sourceforge.net)
Publish web services in UDDI
Encode use-cases into scripts
Build Intelligent Agent / Smart Client node that
can match user needs with scripts web services
using workflows
Develop browser interface through Contextual
Inquiry/Usability Studies
Consider mapping to a Natural Language Interface

32
(No Transcript)
33
Technology

Perl SOAPLite
Will be used for initial web service development
Doesnt really implement WSDL UDDI
Apache Axis Tomcat
Deploy WSDL for web services
BPEL4WS Business Process Execution Language
For aggregation of web services
http//www-128.ibm.com/developerworks/library/spec
ification/ws-bpel/
Microsoft .NET C

34
A 2D structure is supplied for input into the
similarity search (in this case, the extracted
bound ligand from the PDB IY4 complex)
A protein implicated in tumor growth is supplied
to the docking program (in this case HSP90 taken
from the PDB 1Y4 complex)
Correlation of docking results and biological
fingerprints across the human tumor cell lines
can help identify potential mechanisms of action
of DTP compounds
The workflow employs our local NIH DTP database
service to search 200,000 compounds tested in
human tumor cellular assays for similar
structures to the ligand. Client portlets are
used to browse these structures
Once docking is complete, the user visualizes the
high-scoring docked structures in a portlet using
the JMOL applet.
Similar structures are filtered for drugability,
and are automatically passed to the OpenEye FRED
docking program for docking into the target
protein.
35
Search hits are docked into Acetylcholineesterase
and Visualized using JMOL
Hits returned from similarity search on Donepezil
in NIH DTP database using web service or SQL link
Information retrieved from PubChem using Expert
Query
Expert Query can be used to retrieve related
genomic Information from the web
36
Funding at Indiana

500,000 NIH grant to develop an exploratory
Chemical Informatics Cyberinfrastructure
Collaboratory over two years. Partnership with
Informatics and Community Grids Lab. Developing
web services and workflows particularly relating
to PubChem and HTS analysis. May lead to 10m
funding for full center. See www.chembiogrid.org.
49,000 Microsoft Research Smart Clients for
eScience grant for development of this system
For more information, see
http//www.informatics.indiana.edu/djwild

37
PubChem

pubchem.ncbi.nlm.nih.gov
Currently contains 10,096,336 chemical structures
(3/30/06) and growing rapidly
Likely to be a (or the) major worldwide source of
chemical information unless litigation
restricts it (CAS)
Includes the latest InChI representations
Can be queried on chemical structure, exposed as
web services (SOAP)
Linked in with MLI project and biological data
from HTS experiments

38
Supplemental slides
39
Cluster Analysis and Chemical Informatics

Used for organizing datasets into chemical
series, to build predictive models, or to select
representative compounds
Organizational usage has not been as well studies
as the other two, but see
Wild, D.J., Blankley, C.J. Comparison of 2D
Fingerprint Types and Hierarchy Level Selection
Methods for Structural Grouping using Wards
Clustering, Journal of Chemical Information and
Computer Sciences., 2000, 40, 155-162.
Essentially helping large datasets become
manageable
Methods used
Jarvis-Patrick and variants
O(N2), single partition
Wards method
Hierarchical, regarded as best, but at least
O(N2)
K-means
lt O(N2), requires set no of clusters, a little
messy
Sphere-exclusion (Butina)
Fast, simple, similar to JP
Kohonen network
Clusters arranged in 2D grid, ideal for
visualization

40
Limitations of Wards method forlarge datasets
(gt1m)

Best algorithms have O(N2) time requirement (RNN)
Requires random access to fingerprints
hence substantial memory requirements (O(N))
Problem of selection of best partition
can select desired number of clusters
Easily hit 4GB memory addressing limit on 32 bit
machines
Approximately 2m compounds

41
Scaling up clustering methods

Parallelisation
Clustering algorithms can be adapted for multiple
processors
Some algorithms more appropriate than others for
particular architectures
Wards has been parallelized for shared memory
machines, but overhead considerable
New methods and algorithms
Divisive (bisecting) K-means method
Hierarchical Divisive
Approx. O(NlogN)

42
Divisive K-means Clustering

New hierarchical divisive method
Hierarchy built from top down, instead of bottom
up
Divide complete dataset into two clusters
Continue dividing until all items are singletons
Each binary division done using K-means method
Originally proposed for document clustering
Bisecting K-means
Steinbach, Karypis and Kumar (Univ.
Minnesota)http//www-users.cs.umn.edu/karypis/pu
blications/Papers/PDF/doccluster.pdf
Found to be more effective than agglomerative
methods
Forms more uniformly-sized clusters at given
level

43
BCI Divkmeans

Several options for detailed operation
Selection of next cluster for division
size, variance, diameter
affects selection of partitions from hierarchy,
not shape of hierarchy
Options within each K-means division step
distance measure
choice of seeds
batch-mode or continuous update of centroids
termination criterion
Have developed parallel version for Linux
clusters / grids in conjunction with BCI
For more information, see Barnard and Engels
talks at http//cisrg.shef.ac.uk/shef2004/confere
nce.htm

44
Comparative execution timesNCI subsets, 2.2 GHz
Intel Celeron processor
7h 27m
3h 06m
2h 25m
44m
45
Clustering a 1 million compound dataseton a 2.2
GHz Celeron Desktop Machine
Results from AVIDD clusters Teragrid coming
soon.
Time for a single run may vary due to
different selection of seeds. Runtimes can be
shortened e.g. by using a max. number of
iterations or a relocation cutoff.
46
Divisive Kmeans Conclusions

Much faster than Wards, speed comparable to
K-means, suitable for very large datasets
(millions)
Time requirements approximately O(N log N)
Current implementation can cluster 1m compounds
in under a week on a low-power desktop PC
Cluster 1m compounds in a few hours with a 4-node
parallel Linux cluster
Better balance of cluster sizes than Wards or
Kmeans
Visual inspection of clusters suggests better
assembly of compound series than other methods
Better clustering of actives together than
previously-studied methods
Memory requirements minimal
Experiments using AVIDD cluster and Teragrid
forthcoming(50 nodes)

47
Visualization interface level tools

No matter how clever the smarts underneath, the
overriding factor in usefulness will be the
quality of scientists interaction with the
system
Contextual Design, Interaction Design (Cooper)
and Usability Studies have proven effective in
designing the right interfaces for the right
peoplein chemical informatics collaboration
with HCI?
Possibility of multiple interfaces for different
people groups(Coopers primary personas)
Dont assume the browser interface email / NLP
?
Start with the basics
2D chemical structure drawing (input)
Visualization of large numbers of chemical
structures in 2D
3D chemical structure visualization
Planning on evaluation of NLP, email, RSS, etc.
as well asbrowser-based interfaces

48
Usability of 2D structure drawing tools

Key difference between sequential and random
drawers
Huge difference in intuitiveness
Key factor how badly you can mess things up
Marvin Sketch JME gt ChemDraw gtgt ISIS Draw

49
Visualization methods for datasets clusters

Partitions
Spreadsheets
Enhanced Spreadsheets
2D or 3D plots
Hierarchies
Dendograms
Tree Maps
Hyperbolic Maps

50
(No Transcript)
51
(No Transcript)
52
VisualiSAR with a nod to Edward Tufte. See
http//www.daylight.com/meetings/mug99/Wild/Mug99.
html
53
Tree Maps very Tufte-esque
54
3D Visualization - JMOL

Open Source, very flexible, works in a web
service environment jmol.sourceforge.net

55
Sentient - an alternative approachto managing
heterogenous data sources

Collaboration with IO-Informatics (along with
Cornell, and UCSD) for the investigation of
service-oriented architectures in life sciences
research using Sentient software
Aim to integrate several sources of information
relating to Alzheimers Disease (brain imaging,
morphology, gene expression) so that
cross-dataset biomarkers can be identified
Sentient usies Intelligent Multidimensional
Objects (IMOs) to define and query data sources
and the tools used toaccess them
Still a browsing approach, but with a layer of
coherenceand intelligence
Hope to expand to include chemistry data
Can also be used as an interface-level tool

56
(No Transcript)
57
(No Transcript)

Write a Comment

User Comments (0)

About PowerShow.com

Smart mining of drug discovery information using web service workflows - PowerPoint PPT Presentation

Smart mining of drug discovery information using web service workflows

Indiana University School of. About Me. B.Sc. in Computing Science ... Indiana University School of. My definition of Chemical Informatics ... – PowerPoint PPT presentation