Title: Grids/CI%20for%20Scholarly%20Research%20and%20application%20to%20Chemical%20Informatics
1Grids/CI for Scholarly Researchand application
toChemical Informatics
- HPC 2006 in Cetraro Italy
- July 4 2006
- Geoffrey Fox
- Computer Science, Informatics, Physics
- Pervasive Technology Laboratories
- Indiana University Bloomington IN 47401
- gcf_at_indiana.edu
- http//www.infomall.org
2Motivation
- Build Cyberinfrastructure (Grids) that
- Support science from beginning (planning,
instruments) through middle (analysis) and end
(refereed publications, follow-on work) - Integrates with the popular Web 2.0 (community)
tools whose successes point to interesting ways
of working together - Integrate with Digital Library technology
- Does not redo previous work but rather augments
it - Assumes a heterogeneous fragmented world with
multiple platforms - Allows one to specify and manage all the services
and data that a project needs with a mix of
synchronous, asynchronous, close (classic
workflow) and loose (including zero) coupling
3Application Drivers
- Chemical Informatics as this has very precise
naming rules for compounds that allow accurate
searches in documents - Suggesting how to tag scientific documents either
when writing it or after the fact - Global Information Grid (Military Net-Centric
systems) as these inevitably need Grid of Grids
to support systems of systems - Journal web site of the future as illustrated by
Nature building social bookmarking tool Connotea - Conference support tools as can benefit from
features needed by journals
4The Science Drivers
- From Workshop on Challenges of Scientific
Workflows http//vtcpc.isi.edu/wiki/index.php/Main
_Page - Workflow is underlying support for current
science model - Distributed interdisciplinary data deluged
scientific methodology as an end (instrument,
conjecture) to end (paper, Nobel prize) process
is a transformative approach - Reproducibility core to scientific method and
requires rich provenance, interoperable
persistent repositories with linkage of open data
and publication as well as distributed
simulations, data analysis and new algorithms. - Distributed Science Methodology publishes all
steps in a new electronic logbook capturing
scientific process (data analysis) as a rich
cloud of resources including emails, PPT, Wikis
as well as databases, compiler options, build
time/runtime configuration
5Community (? VO) Tools
- e-mail and list-serves are oldest and best used
- Kazaa, Instant Messengers, Skype, Napster,
BitTorrent for P2P Collaboration text,
audio-video conferencing, files - del.icio.us, Connotea, Citeulike, Bibsonomy,
Biolicious manage shared bookmarks (later) - http//en.wikipedia.org/wiki/CategorySocial_bookm
arking - MySpace, Bebo, Hotornot, Facebook, or similar
sites allow you to create (upload) community
resources and share them Friendster, LinkedIn
create networks - http//en.wikipedia.org/wiki/CategorySocial_netwo
rking - http//en.wikipedia.org/wiki/List_of_social_networ
king_websites - Writely, Wikis and Blogs are powerful specialized
shared document systems - ConferenceXP and WebEx share general applications
- Google Scholar (Citeseer) tells you who has cited
your papers while publisher sites tell you about
co-authors - Windows Live Academic Search has similar goals
(later) - Note sharing resources creates (implicit)
communities - Social network tools study graphs to both define
communities and extract their properties
6How to use Web2.0 Community tools in CI
- Nearly all of them have profiles, users,
groups, friends etc. - Need to integrate these
- P2P File Sharing Maybe this is useful for
sharing files in research groups (virtual
organizations) - Will modify Maze http//maze.pku.edu.cn
popular Chinese social P2P system with 2.5
million users - BitTorrent more popular than FTP why not use
for higher performance fault tolerant cached file
sharing? - MySpace etc. Could consider MyK-12ScienceSpace
or MyGridSpace that supports a similar document
sharing model with users uploading pictures,
papers and even data/services of interest - Could include uploaded material in workflows
- Can impose different policies
- Social Bookmarking and linking discuss later
- http//gf6.ucs.indiana.edu48990/SemanticResearchG
rid/
7Native UI-1
Native UI-4
Native UI-3
Native UI-N
SSG MDStore
Gateway WS-1
Gateway WS-2
Gateway WS-3
Gateway WS-N
Integrated User Interface UI
SSG Semantic Scholars Grid
Integration Framework of Tools
8Strategy
- Doesnt seem useful to build the 251st community
tool - In fact a major barrier to use of existing tools
is - What happens when a better tool comes along
and/or chosen tool disappears (unsupported/removed
from Web) - So assume use existing tools but wrap them all as
web services so can transfer information to new
tools and integrate information between tools - Need some glue logic, a unification database
and minimal user interface - Bookmarking tools del.icio.us, Connotea,
CiteULike (includes plug-ins to major publisher
sites) - Document Google Scholar, Windows Live, Citeseer
tools, OSCAR3 for Chemistry (later), Science.gov - Journals Manuscript Central
- Conferences CMT from Microsoft or ?
9Connotea
10Connotea queried by SERVOGrid
11Delicious Semantic Web/Grid
- http//del.icio.us purchased by Yahoo for 30M
- http//www.CiteULike.org
- http//www.connotea.org (Nature)
- Associate metadata with Bookmarks specified by
URLs, DOIs (Digital Object Identifiers) - Users add comments and keywords (called tags)
- Users are linked together into groups
(communities) - Information such as title and authors extracted
automatically from some sites (PubMed, ACM, IEEE,
Wiley etc.) - Bibtex like additional information in CiteULike
- This is perhaps de facto Semantic Web
remarkable for its simplicity
12Document-enhanced Cyberinfrastructureaka
Semantic Scholar Grid I
- Citeseer and Google Scholar scour the Internet
and analyze documents for incidental metadata - Title, author and institution of documents
- Citations with their own metadata allowing one to
match to other documents - Science.gov extracts metadata from lots of US
Government databases - These capabilities are sure to become more
powerful and to be extended - Give Citation Index in real time
- Tell you all authors of all papers that cite a
paper that cites you etc. (Note its a small
world so dont go too far in link analysis) - Tell you all citations of all papers in a
workshop
13Document-enhanced Cyberinfrastructureaka
Semantic Scholar Grid II
- It is natural to develop core document Services
such as those used in Citeseer/Google Scholar but
applied to your documents of interest that may
not have been processed yet - As just submitted to a conference perhaps
- These tools can help form useful lists such as
authors of all cited or submitted papers to a
journal - OSCAR2/3 (from Peter Murray-Rusts group at
Cambridge) augment the application independent
core metadata (Title, authors, institutions,
Citations) with a list of all chemical terms - This tool is a Service that can be applied to
your document or to a set of documents
harvested in some fashion - Other fields have natural application specific
metadata and OSCAR like tools can be developed
for them - Such high value tools could appear on publisher
sites of future (or else publishers will
disappear)
14Document-enhanced Cyberinfrastructure
Del.icio.us
Windows Live Academic Search
TraditionalCyberinfrastructure
ExportRSS, BibtexEndnote etc.
CiteULike
Google Scholar
Connotea
Citeseer
Bibsonomy
Science.gov
Biolicious
PubChem
Generic Document Tools
CMT ConferenceManagement
PubMed
Manuscript Central
Community Tools
Integration/Enhancement User Interface
etc.
Existing User Interface
New Document-enhanced Research Tools
Existing Documentbased Research Tools
15Chemical Informatics as a Grid Application
- Chemical Informatics is the application of
information technology to problems in chemistry. - Example problems managing data in large scale
drug discovery and molecular modeling - Building Blocks Chemical Informatics Resources
- Chemical databases maintained by various groups
- NIH PubChem, NIH DTP, http//nihroadmap.nih.gov/
- Application codes (both commercial and open
source) - Data mining such as clustering
- Quantum chemistry and molecular modeling
- Screening centers (with HTS High Throughput
Screening devices) measuring interaction of
chemicals with biological samples - Visualization tools
- Web resources journal articles, etc.
- Chemical Informatics Grid http//www.chembiogrid.o
rg needs to integrate these into a common,
loosely coupled, distributed computing
environment.
16Document, Simulation and Data rich CI for
Chemical Informatics
?
SCIENTIST
These compounds look promising from their HTS
results. Should I commit some chemistry resources
to following them up?
17HTS results and COMPARE Web service Positive
results (red bar to right of vertical line)
indicates greater than average toxicity of cell
line to tested agent.
http//dtp.nci.nih.gov/docs/compare/compare.html
18HTS data organization flagging
A tumor cell line is selected. The activity
results for all the compounds in the DTP database
in the given range are extracted from the
PostgreSQL database
OpenEye FILTER is used to calculate biological
and chemical properties of the compounds that are
related to their potential effectiveness as drugs
VOPlot
Taverna
The compounds are clustered on chemical structure
similarity, to group similar compounds together
The compounds along with property and cluster
information are converted to VOTABLES format and
displayed in VOPLOT
Use Taverna for Workflow and VOTable (from
astronomy) as basic data structure VOTable of
compounds and properties with Excel-like
spreadsheet services
19Varuna environment for molecular modeling (Baik,
IU)
Researcher
Chemical Concepts
Papers etc.
Experiments
ChemBioGrid
Simulation ServiceFORTRAN Code, Scripts
DB ServiceQueries, Clustering,Curation, etc.
QM Database
ReactionDB
Condor
PubChem, PDB,NCI, etc.
QM/MM Database
Supercomputer
20OSCAR3 Service from Cambridge UK
- Oscar3 is a tool for shallow, chemistry-specific
natural language parsing of chemical documents
(i.e. journal articles). - It identifies (or attempts to identify)
- Chemical names singular nouns, plurals, verbs
etc., also formulae and acronyms. - Chemical data Spectra, melting/boiling point,
yield etc. in experimental sections. - Other entities Things like N(5)-C(3) and so on.
- Uses SMILES, InChI and CML
- There is a larger effort, SciBorg, in this area
- http//www.cl.cam.ac.uk/aac10/escience/sciborg.ht
ml
http//wwmm.ch.cam.ac.uk/wikis/wwmm/index.php/Osca
r3
21OSCAR2 Chemistry Document analysis
- It detects magic chemical strings in text and
then - Stores them as metadata associated with document
- Queries ChemInformatics repositories to tell you
lots of information about identified compounds - Tells you which other documents have this compound
22Clustering Documents from chemicalproperties
23Provenance and Delicious CI
- We can use del.icio.us style interface to
annotate Application Data with (extra) provenance
and user comments of any type (describing quality
of data or a keyword relating different data
etc.) - All data should be labeled by a URI to enable
this - One has in addition Citeseer/OSCAR metadata
- Current major tagging systems support flat list
of tags without namevalue (RDF triple) or schema
organization - RDF Triples ltlt Full Semantic Web
- Delicious ltlt RDF
- Tradeoff between features and pervasive
deployment - Some extra features are easy to add as a custom
service - Features not supported by del.icio.us can be
uploaded as comments
24Current Status
- Google Scholar, Windows Live Academic Search,
del.icio.us, Connotea, CiteULike, OSCAR3 are Web
Services - Debugging on 500 presentations and papers from my
CGL research group - Experiment with GGF Presentations, Broad
collection of Chemical Informatics resources
(explore science document CI link) and
ConcurrencyComputation PracticeExperience Web
site (?business model for journals)
25Collection (Grid) Builder Tool
- This can perhaps be built on top of workflow
systems - Unlike ordinary workflow, this is a tool to
manage collections of Grids and the key metadata
adorning Grids and Services - It instantiates needed mediation between Grids
(systems) to convert - JMS to MQSeries
- GT4 to WS-I
- WS-Eventing to WS-Notification
- It supports conventional workflow as tightly
coupled services - It supports system wide management
(configuration) - We are using WS-Management see CLADE paper
- Deploy services and mediation brokers on demand
to deliver real-time performance - DoD cant pause the battle while WS-RM and TCP
catch up if data saturated
26Grids of Grids of Simple Services
- Grids are managed collections of one or more
services - A simple service is the smallest Grid
- Services and Grids are linked by messages
- Internally to service, functionalities are linked
by methods - Link serices via methods ? messages ? streams
- We are familiar with method-linked
hierarchyLines of Code ? Methods ? Objects ?
Programs ? Packages
27Component Grids?
- So we build collections of Web Services which we
package as component Grids - Visualization Grid
- Sensor Grid
- Utility Computing Grid
- Collaboration Grid
- Earthquake Simulation Grid
- Control Room Grid
- Crisis Management Grid
- Drug Discovery Grid
- Bioinformatics Sequence Analysis Grid
- Intelligence Data-mining Grid
- We build bigger Grids by composing component Grids
28Mediation and Transformation in a Grid of Grids
and Simple Services
Mediation and Transformation Services Distributed
Brokers between distributed ports
Mediation and Transformation Services Listen,
Queue Transform, Send
External facing Interfaces
Mediation and Transformation Services 1-10 ms
Overhead Use OGSA to Federate?
294 Cores is 3000 messages per second about one
message per millisecond per core for Opteron one
message per 2 ms for Sun Niagara core
30Pentium 4 (3.4GHz) with 1GB of RAM while IBM- MQ
Series, Naradabrokering and the Message Bridge
are all running on it.NaradaBrokering running
in JMS emulation mode
Message Size Naradabrokering (JMS) to IBM MQ Naradabrokering (JMS) to IBM MQ IBM MQ to Naradabrokering (JMS) IBM MQ to Naradabrokering (JMS)
Message Size In-order Messages/second No Ordering Messages/second In-Order Messages/second No Ordering Messages/second
100 Bytes 350 530 320 310
1 Kbytes 330 500 290 290
4 Kbytes 200 390 220 210
31Raw Data ? Data ? Information ?
Knowledge ? Wisdom
AnotherGrid
Decisions
AnotherGrid
SS
SS
SS
SS
FS
FS
OS
MD
MD
FS
Portal
Portal
FS
OS
OS
OS
SOAP Messages
OS
FS
FS
FS
FS
AnotherService
FS
MD
MD
MD
OS
MD
OS
OS
OS
OS
FS
Other Service
FS
FS
FS
FS
OS
MD
OS
OS
OS
FS
FS
FS
FS
MD
MD
MD
MD
FS
FS
Filter Service
OS
OS
FS
MetaData
AnotherGrid
FS
FS
FS
MD
Sensor Service
SS
SS
SS
SS
SS
SS
SS
SS
SS
SS
AnotherService