EChemistry and Web 2.0 - PowerPoint PPT Presentation

1 / 63
About This Presentation
Title:

EChemistry and Web 2.0

Description:

Geoffrey Fox, Gary Wiggins, Rajarshi Guha, David Wild, Mookie ... Peter Murray-Rust (Cambridge), Herbert Van de Sompel (Los Alamos), Geoffrey Fox (Indiana) ... – PowerPoint PPT presentation

Number of Views:103
Avg rating:3.0/5.0
Slides: 64
Provided by: servo
Category:
Tags: echemistry | fox | peter | web

less

Transcript and Presenter's Notes

Title: EChemistry and Web 2.0


1
E-Chemistry and Web 2.0
  • Marlon Pierce
  • mpierce_at_cs.indiana.edu
  • Community Grids Lab
  • Indiana University

2
One Talk, Two Projects
  • NIH funded Chemical Informatics and
    Cyberinfrastructure Collaboratory.
  • Geoffrey Fox, Gary Wiggins, Rajarshi Guha, David
    Wild, Mookie Baik, Kevin Gilbert
  • Proposed Microsoft-Funded Project
  • Carl Lagoze (Cornell), Lee Giles (PSU), Steve
    Bryant (NIH), Jeremy Frey (Soton), Peter
    Murray-Rust (Cambridge), Herbert Van de Sompel
    (Los Alamos), Geoffrey Fox (Indiana)

3
CICC Project Summary
  • Creating a comprehensive, easily accessible
    infrastructure for chemoinformatics tools and
    data sources, linked with NIH PubChem and made
    available as web services, and partnering with
    screening centers and other users to demonstrate
    how this infrastructure can be usefully applied
  • Infrastructure can include any tools, not just
    ours (commercial/open source, chemoinformatics,
    bioinformatics, and so on)
  • New, custom applications can be built quickly
    using existing services in a similar way to
    Google Maps and other web 2.0 resources
  • Develop education program to foster chemical
    informatics as an academic discipline.
  • Field is dominated by big pharma
  • Need to train students, provide open research
    environment.

4
Chemical Informatics and Cyberinfrastucture
Collaboratory Funded by the National Institutes
of Health www.chembiogrid.org
CICC
CICC
CICC Combines Grid Computing with Chemical
Informatics
Large Scale Computing Challenges
Science and Cyberinfrastructure
CICC is an NIH funded project to support chemical
informatics needs of High Throughput Cancer
Screening Centers. The NIH is creating a data
deluge of publicly available data on potential
new drugs.
Chemical Informatics is non-traditional area of
high performance computing, but many new,
challenging problems may be investigated.
NIH PubMed DataBase
OSCAR Text Analysis
Toxicity Filtering
Cluster Grouping
Docking
.
Initial 3D Structure Calculation
OSCAR-mined molecular signatures can be
clustered, filtered for toxicity, and docked onto
larger proteins. These are classic pleasingly
parallel tasks. Top-ranking docked molecules
can be further examined for drug potential.
Chemical informatics text analysis programs can
process 100,000s of abstracts of online
journal articles to extract chemical signatures
of potential drugs.
Molecular Mechanics Calculations
Big Red (and the TeraGrid) will also enable us to
perform time consuming, multi-stepped Quantum
Chemistry calculations on all of PubMed. Results
go back to public databases that are freely
accessible by the scientific community.
  • CICC supports the NIH mission by combining state
    of the art chemical informatics techniques with
  • World class high performance computing
  • National-scale computing resources (TeraGrid)
  • Internet-standard web services
  • International activities for service
    orchestration
  • Open distributed computing infrastructure for
    scientists world wide

NIH PubChem DataBase
Quantum Mechanics Calculations
IUs Varuna DataBase
POVRay Parallel Rendering
Indiana University Department of Chemistry,
School of Informatics, and Pervasive Technology
Laboratories
5
CICC Infrastructure Vision
  • Drug Discovery and other academic chemistry and
    pharmacologyresearch will be aided by powerful
    modern information technology CICC set up as
    distributed cyberinfrastructure in eScience model
  • Web clients (user interfaces) to distributed
    databases, results of high throughput screening
    instruments, results of computational chemical
    simulations and other analyses.
  • Aggregated into portals
  • Web services manipulate this data and are
    combined into workflows.
  • CICC includes access to PubChem, PubMed, PubMed
    Central, the Internet and its derivatives like
    Microsoft Academic Live and Google Scholar
  • The services include open-source software like
    CDK, commercial code from vendors from BCI,
    OpenEye, Gaussian and Google, and any user
    contributed programs

6
Services and Sample Clients
  • Our SOA philosophy use standard Web services.
  • Mostly stateless
  • Some cluster, HPC work
  • Services are aggregate-able into different
    workflows.
  • Taverna, Pipeline Pilot,
  • You can also build lots of clients.
  • See http//www.chembiogrid.org/wiki/index.php/CICC
    _Web_Resources for links and details.
  • Not so far from Web 2.0.

7
Sample Services
8
More Services
9
More Services.
10
More Services
11
More Services (Last)
12
Web Client Interfaces
13
More Clients
14
More Clients
15
Web Service Locations
Web Service Locations
  • Cambridge University
  • InChi generation / search
  • CMLRSS
  • OpenBabel
  • Cambridge University
  • InChi generation / search
  • CMLRSS
  • OpenBabel
  • Indiana University
  • Clustering
  • VOTables
  • OSCAR3
  • Toxicity classification
  • Database services
  • Indiana University
  • Clustering
  • VOTables
  • Toxicity classification
  • Database services
  • Statistics services
  • VCC Laboratory
  • ALogPS
  • NCI
  • CSLS
  • University of Cologne
  • NMRShiftDB

16
Where Does The Functionality Come From?
  • University of Michigan
  • PkCell
  • Cambridge University
  • InChi generation / search
  • OSCAR
  • DigitalChemistry
  • BCI fingerprints
  • DivKMeans

gNova Consulting
  • NIH
  • PubChem
  • PubMed
  • CDK
  • Cheminformatics
  • European Chemicals Bureau
  • ToxTree toxicity predictions
  • OpenEye
  • Docking
  • R Foundation
  • R package
  • Indiana University
  • VOTables
  • NCI DTP predictions
  • Database services

17
MLSCN Post-HTS Biology Decision Support
Percent Inhibition or IC50 data is retrieved from
HTS
Grids can link data analysis ( e.g image
processing developed in existing Grids),
traditional Chem-informatics tools, as well as
annotation tools (Semantic Web, del.icio.us) and
enhance lead ID and SAR analysis A Grid of Grids
linking collections of services atPubChem ECCR
centers MLSCN centers
Workflows encoding plate control well
statistics, distribution analysis, etc
Question Was this screen successful?
Workflows encoding distribution analysis of
screening results
Question What should the active/inactive cutoffs
be?

Question What can we learn about the target
protein or cell line from this screen?
Workflows encoding statistical comparison of
results to similar screens, docking of compounds
into proteins to correlate binding, with
activity, literature search of active compounds,
etc
Compounds submitted to PubChem
CHEMINFORMATICS
PROCESS
GRIDS
18
Example HTS workflow finding cell-protein
relationships
A protein implicated in tumor growth with known
ligand is selected (in this case HSP90 taken from
the PDB 1Y4 complex)
The screening data from a cellular HTS assay is
similarity searched for compounds with similar 2D
structures to the ligand.
Docking results and activity patterns fed into R
services for building of activity models and
correlations
LeastSquares Regression
RandomForests
NeuralNets
Similar structures are filtered for drugability,
are converted to 3D, and are automatically passed
to the OpenEye FRED docking program for docking
into the target protein.
Once docking is complete, the user visualizes the
high-scoring docked structures in a portlet using
the JMOL applet.
Similar structures to the ligand can be browsed
using client portlets.
19
Example PubDock
  • Database of approximately 1 million PubChem
    structures (the most drug-like) docked into
    proteins taken from the PDB
  • Available as a web service, so structures can be
    accessed in your own programs, or using workflow
    tools like Pipeline Polit
  • Several interfaces developed, including one based
    on Chimera (right) which integrates the database
    with the PDB to allow browsing of compounds in
    different targets, or different compounds in the
    same target
  • Can be used as a tool to help understand
    molecular basis of activity in cellular or image
    based assays

20
Example R Statistics applied to PubChem data
  • By exposing the R statistical package, and the
    Chemistry Development Kit (CDK) toolkit as web
    services and integrating them with PubChem, we
    can quickly and easily perform statistical
    analysis and virtual screening of PubChem assay
    data
  • Predictive models for particular screens are
    exposed as web services, and can be used either
    as simple web tools or integrated into other
    applications
  • Example uses DTP Tumor Cell Line screens - a
    predictive model using Random Forests in R makes
    predictions of probability of activity across
    multiple cell lines.

21
RSS Feeds
  • Provide access to DB's via RSS feeds
  • Feeds include 2D/3D structures in CML
  • Viewable in Bioclipse, Jmol as well as Sage etc.
  • Two feeds currently available
  • SynSearch get structures based on full or
    partial chemical names
  • DockSearch get best N structures for a target

22
R, CDK PubChem
  • Goals
  • Access cheminformatics from within R
  • Access PubChem data from within R
  • rcdk package allows to do cheminformatics within
    R using CDK functionality
  • rpubchem provides access to PubChem compound data
    and bioassay data
  • Searchable via assay ID, keywords
  • J. Stat. Soft, 2007, 18(6)

23
Databases
  • Most of our databases aim to add value to PubChem
    or link into PubChem
  • We maintain a local mirror for testing, data
    mining
  • 3D structures (MMFF94)
  • Searchable by CID, SMARTS, 3D similarity
  • Docked ligands (FRED)
  • 906K drug-like compounds into 7 ligands
  • Will eventually cover 2000 targets

24
(Cheminformatics) Algorithm Development
  • Goals
  • Focus on interpretability and applicability
  • Devise novel approaches to clustering problems
  • Investigate the utility of low dimensional
    representations for a variety of problems
  • Examples
  • Ensemble feature selection (JCIM, in press)
  • Cluster counting with R-NN curves (in revision)

25
Chemical Data Mining
  • Collaboration on screening data with Scripps, FL
  • Random forests (modeling feature selection)
  • Naïve Bayes (modeling)
  • Identifying features indicative of toxicity
  • Domain applicability
  • NCI DTP Cell line activity predictions
  • Random forest models for 60 cell lines
  • All available as
  • downloadable R models
  • web services (supply SMILES, get prediction) with
    web page clients

26
Mining information from journal articles
  • Until now SciFinder / CAS only chemistry-aware
    portal into journal information
  • We can access full text of journal articles
    online (with subscription)
  • ACS does not make full text available but there
    are ways round that!
  • RSC is now marking up with SMILES and GO/Goldbook
    terms!
  • www.projectprospect.org
  • Having SMILES or InChI means that we can build a
    similarity/structure searchable database of
    papers e.g. find me all the papers published
    since 2000 which contain a structure with 90
    similarity to this one
  • In the absence of full text, we can at least use
    the abstract
  • OSCAR3 - Murray Rust Group
  • A tool for shallow, chemistry-specific natural
    language parsing of chemical documents (e.g.
    journal articles).
  • It identifies (or attempts to identify)
  • Chemical names singular nouns, plurals, verbs
    etc., also formulae and acronyms.
  • Chemical data Spectra, melting/boiling point,
    yield etc. in experimental sections.
  • Other entities Things like N(5)-C(3) and so on.
  • Part of the larger SciBorg effort
  • See http//www.cl.cam.ac.uk/aac10/escience/scibor
    g.html)
  • http//wwmm.ch.cam.ac.uk/wikis/wwmm/index.php/Osca
    r3

27
E-Chemistry and Digital Libraries
28
E-Chemistry and Digital Libraries
  • Key problem with our SOA-based e-Science is
    information management.
  • Where is the service?
  • What does it do?
  • We may also consider our data-centric services to
    be digital libraries.
  • Data is diverse
  • Documents
  • Not just computational information like
    structures.

29
Digital Libraries
  • Open Archives Initiative Object Reuse and
    Exchange Project (OAI-ORE)
  • Developing standardized, interoperable, and
    machine-readable mechanisms to express
    information about compound information objects on
    the web.
  • Graph-based representations

30
Security
31
(No Transcript)
32
(No Transcript)
33
More Information
  • Project Web Site www.chembiogrid.org
  • Project Wiki www.chembiogrid.org/wiki
  • Contact me mpierce_at_cs.indiana.edu

34
(No Transcript)
35
R Web Services
36
Why?
  • Need access to math and stat functionality
  • Did not want to recode algorithms
  • Wanted latest methods
  • Needed a distributed approach to computation
  • Keep computation on a powerful machine
  • Access it from a smaller machine

37
Why R?
  • Free, open-source
  • Many cutting edge methods avilable
  • Flexible programming language
  • Interfaces with many languages
  • Python
  • Perl
  • Java
  • C

38
The R Server
  • R can be run as a remote compute server
  • Requires the rserve package
  • Allows authenticated access over TCP/IP
  • Connections can maintain state
  • Client libraries for Java C

39
R as a Web Service
  • On its own the R server is not a web service
  • We provide Java frontends to specific
    functionalities
  • The frontend classes are hosted in a Tomcat web
    container
  • Accessible via SOAP
  • Full Javadocs for all available WSs

40
Flowchart
41
Functionality
  • Two classes of functionality
  • General functions
  • Allows you to supply data and build a predictive
    model
  • Sample from various distributions
  • Obtain scatter plots and hisotgram
  • Model development functions use a Java front-end
    to encapsulate model specific information

42
Functionality
  • Two classes of functionality
  • Model deployment
  • Allows you to build a model outside of the
    infrastructure
  • Place the final model in the infrastructure
  • Becomes available as a web service
  • Each model deployed requires its own front end
    class
  • In general, these classes are identical - could
    be autogenerated

43
Available Functionality
  • Predictive models - OLS, RF, CNN, LDA
  • Clustering - k-means
  • Statistical distributions
  • XY plot and scatter plots
  • Model deployment for single model types and
    ensemble model types

44
Deployed Models
  • Since deployed models are visible as web services
    we can build a simple web front end for them
  • Examples
  • NCI anti-cancer predictions
  • Ames mutagenicity predictions

45
Applications
  • The R WS is not restricted to atomic
    functionality
  • Can write a whole R program
  • Load it on the R compute server
  • Provide a Java WS frontend
  • Examples
  • Feature selection
  • Automated model generation
  • Pharmacokinetic parameter calculation

46
Data Input/Output
  • Most modeling applications require data matrices
  • Depending on client language we can use
  • SOAP array of arrays (2D matrices)
  • SOAP array (1D vector form of a 2D matrix)
  • VOTables

47
Data Input/Output
  • Some R web services can take a URL to a VOTables
    document
  • Conversion to R or Java matrices is done by a
    local VOTables Java library
  • R also has basic support for VOTables directly
  • Ignores binary data streams

48
Interacting With R WSs
  • Traditional WSs do not maintain state
  • Predictive models are different
  • A model is built at one time
  • May be used for prediction at another time
  • Need to maintain state
  • State is maintained by serialization to R binary
    files on the compute server
  • Clients deal with model IDs

49
Interacting with R WSs
  • Protocol
  • Send data to model WS
  • Get back model ID
  • Get various information via model ID
  • Fitted values
  • Training statistics
  • New predictions

50
Cheminformatics at Indiana University School of
Informatics
  • David J. Wild
  • djwild_at_indiana.edu
  • Associate Director of Chemical Informatics
    Assistant Professor
  • Indiana University School of Informatics,
    Bloomington
  • http//djwild.info

51
Cheminformatics education at Indiana
  • M.S. in Chemical Informatics
  • 2 years, 36 semester hours
  • Includes a 6-hour capstone / research project
  • Opportunity to work in Laboratory Informatics
    (IUPUI) or closely with Bioinformatics (IUB)
  • Currently 9 students enrolled
  • Ph.D. in Informatics, Cheminformatics Specialty
  • 90 credit hours, including 30 hours dissertation
    research. Usually 4 years.
  • Research rotations expose students to research in
    related areas
  • Currently 4 students enrolled
  • Graduate Certificate
  • 4 courses, all available by Distance Education
  • I571 Chemical Information Technology
  • I572 Computational Chemistry Molecular Modeling
  • I573 Programming for Science Informatics
  • I553 Independent Study in Chemical Informatics
  • D.E. students pay in-state fees! (800 per
    class)
  • See http//cheminfo.informatics.indiana.edu for
    more information, or a general review of
    cheminformatics education in Drug Discovery Today
    11, 910 (May 2006), pp436-439

52
Distance Education for Cheminformatics
  • Uses Breeze teleconference for live sharing of
    classes all that is required is a P.C. and a
    telephone. Optional Polycom videoconferencing.
  • Lectures are recorded for easy playback through a
    web browser
  • Wiki or similar webpage for dissemination of
    course materials
  • Also participate in CIC courseshare to give class
    at University of Michigan
  • Of 75 students taking our courses since fall
    2005, 39 have been D.E. students
  • See JCIM 2006 46(2) pp 495 - 502 for more
    details

53
Current research in the Wild lab
  • Integration of cheminformatics tools and data
    sources
  • A web service infrastructure for cheminformatics
  • Compound information aggregation web service
    and interface (by the way box)
  • An enhanced chatbot for exploting chemical
    information web services
  • A semantically-aware workflow tools for
    cheminformatics
  • Data mining the NIH DTP tumor cell line database
  • PubDock a docking database for PubChem
  • Aggregating life science information from web and
    journal documents
  • Data mining semantically rich chemistry journal
    articles
  • Document similarity based on chemical structure
    similarity
  • Evaluating semantic markup of chemistry journal
    articles
  • Integrating cheminformatics into the chemistry
    lab
  • Integrating cheminformatics with the Second Life
    virtual world
  • Integrating cheminformatics tools with electronic
    lab notebooks
  • Usability of cheminformatics tools

54
Current research in the Guha lab
  • Predictive Modeling
  • Interpretation, validation, domain applicability
  • Generalization to other models such as docking,
    pharmacophore etc
  • Integration of multiple data types
  • Addressing imbalanced and noisy datasets
  • Analysis of Chemical Spaces
  • Quantify distributions in spaces
  • Investigation of density approaches
  • Applications to lead hopping, model domains
  • Methods to summarize compare data
  • Applications to HTS and smaller lead series type
    datasets
  • Network models combining chemical structures and
    biological systems
  • Software and infrastructure
  • Model exchange and annotation
  • Pharmacophore representations, matching
  • Toolkit development (CDK)

55
Cheminformatics web service infrastructure
Cheminformatics services Docking (FRED) 3D
structure generation (OMEGA) Filtering (FRED,
etc) OSCAR3 Fingerprints (BCI, CDK) Clustering
(BCI) Toxicity prediction (ToxTree) R-based
predictive models Similarity calculations
(CDK) Descriptor calculation (CDK) 2D structure
diagrams (CDK)
  • Database Services
  • PostgreSQL gNova
  • PubChem mirror (augmented)
  • Pub3D - 3D structures for PubChem
  • PubDock - Bound 3D structures
  • Compound-indexed journal article DB
  • NIH Human Tumor Cell Line
  • Local PubChem mirror
  • VARUNA quantum chemistry database
  • Statistics (based on R)
  • Regression, LDA
  • Neural Nets, Random Forest
  • K-means clustering
  • Plotting
  • T-test and distribution sampling

Xiao Dong, Kevin E. Gilbert, Rajarshi Guha, Randy
Heiland, Jungkee Kim, Marlon E. Pierce, Geoffrey
C. Fox and David J. Wild, Web service
infrastructure for chemoinformatics, Journal of
Chemical Information and Modeling, 2007 47(4) pp
1303-1307
56
Tools and mashups based on web service
infrastructure
http//www.chembiogrid.org/projects/proj_tools.htm
l
57
PubDock - database of docked PubChem Ligands
  • 1 million PubChem compounds (drugable) docked
    into PDB proteins (currently 7 but more coming -
    hope to have 100 or so)
  • Multiple interfaces. This is really a
    bioinformatics / chemoinformatics mashup
  • Retrieve top hits for a protein
  • Organize proteins by similarity between docking
    profiles over compounds
  • Cluster compounds by docking profile across
    targets
  • Uses many web services PDB services, our PubDock
    database service, our CDK services etc

58
MashUp What published compounds might bind to
this protein?
Create a database containing thetext of all
recent PubMed abstracts(2006-2007 500,000)
Use OSCAR to extract all of the chemical names
referred to in the abstracts and covert to SMILES
DATABASE SERVICE

DOCKING SERVICE
Convert molecules to 3D and dock into a protein
of interest
Visualize top docked molecules in a Google-like
interface
59
RSC Project Prospect - what can we do with the
information?
  • www.projectprospect.org
  • 100 papers marked up with SMILES/InChI (using
    OSCAR3), plus Gene Ontology and Goldbook Ontology
    terms
  • Created similarity searchable PostgreSQL / gNova
    database with paper DOIs, SMILES, and ontology
    terms
  • Web service and simple HTML interfaces for
    searching which papers reference compounds
    similar to this one in the scope of these
    ontological terms?
  • Applying statistics to look at co-occurrence of
    compounds, structural features (MACCS keys) and
    ontological terms in papers

60
Greasemonkey / OSCAR script
http//cheminfo.informatics.indiana.edu8080/ChemG
M/index.jsp
61
By the way annotation (mock-up!)
By the way This compounds is very similar to a
prescription drug, Tamoxifen. This compound is
referenced in 20 journal articles published in
the last 5 years Similar compounds are associated
with the words toxic and death in 280 web
pages It appears to be covered under 3 patents It
has been shown to be active in 5 screens Computer
models predict it to show some activity against 8
protein targets Here are some comments on this
compound David Wild dont take any notice of
the computational models - they are rubbish
62
Cheminformatics aware simple lab notebook (mock
up!)
Plug-in allows structures to be drawn with the
pen and cleaned up
Some useful chemical reactions Iodoacetate a
Iodoacetamide I-CH4COO- ICH2CONH2 This
may also react, chem favored by alkaline pH .
Web service interfaceprovides access
to computation and searching. Page is marked up
by what is possible
FIND INFO ABOUT THIS REACTION
Free text input can be converted to
machine readable form by electrovaya
Automatic detection ofdata fields (yield,
etc) Where possible
63
Automatic workflow generation and natural
language queries
  • Develop service ontology using OWL-S or similar
    language
  • Allows service interoperability, replacement and
    input/outut compatibility
  • We can then use generic reasoning and network
    analysis tools to find paths from inputs to
    desired outputs
  • Natural language can be parsed to inputs and
    desired outputs
  • Smart Clients Agents Services
  • Possible supercharged life science Google? -
    e.g. type in what compounds might bind to the
    enclosed protein?

3D search
2dsimilarity
3D structures are compounds
2D - 3D
2D structures
3D structures
2Dstructurecrawler
2D structures
3D structures
result
3D structures complexes
dock
Pphoresearch
2D structures are compounds
3D proteinstructure
3D structures are compounds
dock bind
Write a Comment
User Comments (0)
About PowerShow.com