Grids/CI%20for%20Scholarly%20Research%20and%20application%20to%20Chemical%20Informatics - PowerPoint PPT Presentation

About This Presentation
Title:

Grids/CI%20for%20Scholarly%20Research%20and%20application%20to%20Chemical%20Informatics

Description:

MySpace, Bebo, Hotornot, Facebook, or similar sites allow you to create (upload) ... Application codes (both commercial and open source) Data mining such as ... – PowerPoint PPT presentation

Number of Views:79
Avg rating:3.0/5.0
Slides: 32
Provided by: gridsUcs
Category:

less

Transcript and Presenter's Notes

Title: Grids/CI%20for%20Scholarly%20Research%20and%20application%20to%20Chemical%20Informatics


1
Grids/CI for Scholarly Researchand application
toChemical Informatics
  • HPC 2006 in Cetraro Italy
  • July 4 2006
  • Geoffrey Fox
  • Computer Science, Informatics, Physics
  • Pervasive Technology Laboratories
  • Indiana University Bloomington IN 47401
  • gcf_at_indiana.edu
  • http//www.infomall.org

2
Motivation
  • Build Cyberinfrastructure (Grids) that
  • Support science from beginning (planning,
    instruments) through middle (analysis) and end
    (refereed publications, follow-on work)
  • Integrates with the popular Web 2.0 (community)
    tools whose successes point to interesting ways
    of working together
  • Integrate with Digital Library technology
  • Does not redo previous work but rather augments
    it
  • Assumes a heterogeneous fragmented world with
    multiple platforms
  • Allows one to specify and manage all the services
    and data that a project needs with a mix of
    synchronous, asynchronous, close (classic
    workflow) and loose (including zero) coupling

3
Application Drivers
  • Chemical Informatics as this has very precise
    naming rules for compounds that allow accurate
    searches in documents
  • Suggesting how to tag scientific documents either
    when writing it or after the fact
  • Global Information Grid (Military Net-Centric
    systems) as these inevitably need Grid of Grids
    to support systems of systems
  • Journal web site of the future as illustrated by
    Nature building social bookmarking tool Connotea
  • Conference support tools as can benefit from
    features needed by journals

4
The Science Drivers
  • From Workshop on Challenges of Scientific
    Workflows http//vtcpc.isi.edu/wiki/index.php/Main
    _Page
  • Workflow is underlying support for current
    science model
  • Distributed interdisciplinary data deluged
    scientific methodology as an end (instrument,
    conjecture) to end (paper, Nobel prize) process
    is a transformative approach
  • Reproducibility core to scientific method and
    requires rich provenance, interoperable
    persistent repositories with linkage of open data
    and publication as well as distributed
    simulations, data analysis and new algorithms.
  • Distributed Science Methodology publishes all
    steps in a new electronic logbook capturing
    scientific process (data analysis) as a rich
    cloud of resources including emails, PPT, Wikis
    as well as databases, compiler options, build
    time/runtime configuration

5
Community (? VO) Tools
  • e-mail and list-serves are oldest and best used
  • Kazaa, Instant Messengers, Skype, Napster,
    BitTorrent for P2P Collaboration text,
    audio-video conferencing, files
  • del.icio.us, Connotea, Citeulike, Bibsonomy,
    Biolicious manage shared bookmarks (later)
  • http//en.wikipedia.org/wiki/CategorySocial_bookm
    arking
  • MySpace, Bebo, Hotornot, Facebook, or similar
    sites allow you to create (upload) community
    resources and share them Friendster, LinkedIn
    create networks
  • http//en.wikipedia.org/wiki/CategorySocial_netwo
    rking
  • http//en.wikipedia.org/wiki/List_of_social_networ
    king_websites
  • Writely, Wikis and Blogs are powerful specialized
    shared document systems
  • ConferenceXP and WebEx share general applications
  • Google Scholar (Citeseer) tells you who has cited
    your papers while publisher sites tell you about
    co-authors
  • Windows Live Academic Search has similar goals
    (later)
  • Note sharing resources creates (implicit)
    communities
  • Social network tools study graphs to both define
    communities and extract their properties

6
How to use Web2.0 Community tools in CI
  • Nearly all of them have profiles, users,
    groups, friends etc.
  • Need to integrate these
  • P2P File Sharing Maybe this is useful for
    sharing files in research groups (virtual
    organizations)
  • Will modify Maze http//maze.pku.edu.cn
    popular Chinese social P2P system with 2.5
    million users
  • BitTorrent more popular than FTP why not use
    for higher performance fault tolerant cached file
    sharing?
  • MySpace etc. Could consider MyK-12ScienceSpace
    or MyGridSpace that supports a similar document
    sharing model with users uploading pictures,
    papers and even data/services of interest
  • Could include uploaded material in workflows
  • Can impose different policies
  • Social Bookmarking and linking discuss later
  • http//gf6.ucs.indiana.edu48990/SemanticResearchG
    rid/

7
Native UI-1
Native UI-4
Native UI-3
Native UI-N
SSG MDStore
Gateway WS-1
Gateway WS-2
Gateway WS-3
Gateway WS-N
Integrated User Interface UI
SSG Semantic Scholars Grid
Integration Framework of Tools
8
Strategy
  • Doesnt seem useful to build the 251st community
    tool
  • In fact a major barrier to use of existing tools
    is
  • What happens when a better tool comes along
    and/or chosen tool disappears (unsupported/removed
    from Web)
  • So assume use existing tools but wrap them all as
    web services so can transfer information to new
    tools and integrate information between tools
  • Need some glue logic, a unification database
    and minimal user interface
  • Bookmarking tools del.icio.us, Connotea,
    CiteULike (includes plug-ins to major publisher
    sites)
  • Document Google Scholar, Windows Live, Citeseer
    tools, OSCAR3 for Chemistry (later), Science.gov
  • Journals Manuscript Central
  • Conferences CMT from Microsoft or ?

9
Connotea
10
Connotea queried by SERVOGrid
11
Delicious Semantic Web/Grid
  • http//del.icio.us purchased by Yahoo for 30M
  • http//www.CiteULike.org
  • http//www.connotea.org (Nature)
  • Associate metadata with Bookmarks specified by
    URLs, DOIs (Digital Object Identifiers)
  • Users add comments and keywords (called tags)
  • Users are linked together into groups
    (communities)
  • Information such as title and authors extracted
    automatically from some sites (PubMed, ACM, IEEE,
    Wiley etc.)
  • Bibtex like additional information in CiteULike
  • This is perhaps de facto Semantic Web
    remarkable for its simplicity

12
Document-enhanced Cyberinfrastructureaka
Semantic Scholar Grid I
  • Citeseer and Google Scholar scour the Internet
    and analyze documents for incidental metadata
  • Title, author and institution of documents
  • Citations with their own metadata allowing one to
    match to other documents
  • Science.gov extracts metadata from lots of US
    Government databases
  • These capabilities are sure to become more
    powerful and to be extended
  • Give Citation Index in real time
  • Tell you all authors of all papers that cite a
    paper that cites you etc. (Note its a small
    world so dont go too far in link analysis)
  • Tell you all citations of all papers in a
    workshop

13
Document-enhanced Cyberinfrastructureaka
Semantic Scholar Grid II
  • It is natural to develop core document Services
    such as those used in Citeseer/Google Scholar but
    applied to your documents of interest that may
    not have been processed yet
  • As just submitted to a conference perhaps
  • These tools can help form useful lists such as
    authors of all cited or submitted papers to a
    journal
  • OSCAR2/3 (from Peter Murray-Rusts group at
    Cambridge) augment the application independent
    core metadata (Title, authors, institutions,
    Citations) with a list of all chemical terms
  • This tool is a Service that can be applied to
    your document or to a set of documents
    harvested in some fashion
  • Other fields have natural application specific
    metadata and OSCAR like tools can be developed
    for them
  • Such high value tools could appear on publisher
    sites of future (or else publishers will
    disappear)

14
Document-enhanced Cyberinfrastructure
Del.icio.us
Windows Live Academic Search
TraditionalCyberinfrastructure
ExportRSS, BibtexEndnote etc.
CiteULike
Google Scholar
Connotea
Citeseer
Bibsonomy
Science.gov
Biolicious
PubChem
Generic Document Tools
CMT ConferenceManagement
PubMed
Manuscript Central
Community Tools
Integration/Enhancement User Interface
etc.
Existing User Interface
New Document-enhanced Research Tools
Existing Documentbased Research Tools
15
Chemical Informatics as a Grid Application
  • Chemical Informatics is the application of
    information technology to problems in chemistry.
  • Example problems managing data in large scale
    drug discovery and molecular modeling
  • Building Blocks Chemical Informatics Resources
  • Chemical databases maintained by various groups
  • NIH PubChem, NIH DTP, http//nihroadmap.nih.gov/
  • Application codes (both commercial and open
    source)
  • Data mining such as clustering
  • Quantum chemistry and molecular modeling
  • Screening centers (with HTS High Throughput
    Screening devices) measuring interaction of
    chemicals with biological samples
  • Visualization tools
  • Web resources journal articles, etc.
  • Chemical Informatics Grid http//www.chembiogrid.o
    rg needs to integrate these into a common,
    loosely coupled, distributed computing
    environment.

16
Document, Simulation and Data rich CI for
Chemical Informatics
?
SCIENTIST
These compounds look promising from their HTS
results. Should I commit some chemistry resources
to following them up?
17
HTS results and COMPARE Web service Positive
results (red bar to right of vertical line)
indicates greater than average toxicity of cell
line to tested agent.
http//dtp.nci.nih.gov/docs/compare/compare.html
18
HTS data organization flagging
A tumor cell line is selected. The activity
results for all the compounds in the DTP database
in the given range are extracted from the
PostgreSQL database
OpenEye FILTER is used to calculate biological
and chemical properties of the compounds that are
related to their potential effectiveness as drugs
VOPlot
Taverna
The compounds are clustered on chemical structure
similarity, to group similar compounds together
The compounds along with property and cluster
information are converted to VOTABLES format and
displayed in VOPLOT
Use Taverna for Workflow and VOTable (from
astronomy) as basic data structure VOTable of
compounds and properties with Excel-like
spreadsheet services
19
Varuna environment for molecular modeling (Baik,
IU)
Researcher
Chemical Concepts
Papers etc.
Experiments
ChemBioGrid
Simulation ServiceFORTRAN Code, Scripts
DB ServiceQueries, Clustering,Curation, etc.
QM Database
ReactionDB
Condor
PubChem, PDB,NCI, etc.
QM/MM Database
Supercomputer
20
OSCAR3 Service from Cambridge UK
  • Oscar3 is a tool for shallow, chemistry-specific
    natural language parsing of chemical documents
    (i.e. journal articles).
  • It identifies (or attempts to identify)
  • Chemical names singular nouns, plurals, verbs
    etc., also formulae and acronyms.
  • Chemical data Spectra, melting/boiling point,
    yield etc. in experimental sections.
  • Other entities Things like N(5)-C(3) and so on.
  • Uses SMILES, InChI and CML
  • There is a larger effort, SciBorg, in this area
  • http//www.cl.cam.ac.uk/aac10/escience/sciborg.ht
    ml

http//wwmm.ch.cam.ac.uk/wikis/wwmm/index.php/Osca
r3
21
OSCAR2 Chemistry Document analysis
  • It detects magic chemical strings in text and
    then
  • Stores them as metadata associated with document
  • Queries ChemInformatics repositories to tell you
    lots of information about identified compounds
  • Tells you which other documents have this compound

22
Clustering Documents from chemicalproperties
23
Provenance and Delicious CI
  • We can use del.icio.us style interface to
    annotate Application Data with (extra) provenance
    and user comments of any type (describing quality
    of data or a keyword relating different data
    etc.)
  • All data should be labeled by a URI to enable
    this
  • One has in addition Citeseer/OSCAR metadata
  • Current major tagging systems support flat list
    of tags without namevalue (RDF triple) or schema
    organization
  • RDF Triples ltlt Full Semantic Web
  • Delicious ltlt RDF
  • Tradeoff between features and pervasive
    deployment
  • Some extra features are easy to add as a custom
    service
  • Features not supported by del.icio.us can be
    uploaded as comments

24
Current Status
  • Google Scholar, Windows Live Academic Search,
    del.icio.us, Connotea, CiteULike, OSCAR3 are Web
    Services
  • Debugging on 500 presentations and papers from my
    CGL research group
  • Experiment with GGF Presentations, Broad
    collection of Chemical Informatics resources
    (explore science document CI link) and
    ConcurrencyComputation PracticeExperience Web
    site (?business model for journals)

25
Collection (Grid) Builder Tool
  • This can perhaps be built on top of workflow
    systems
  • Unlike ordinary workflow, this is a tool to
    manage collections of Grids and the key metadata
    adorning Grids and Services
  • It instantiates needed mediation between Grids
    (systems) to convert
  • JMS to MQSeries
  • GT4 to WS-I
  • WS-Eventing to WS-Notification
  • It supports conventional workflow as tightly
    coupled services
  • It supports system wide management
    (configuration)
  • We are using WS-Management see CLADE paper
  • Deploy services and mediation brokers on demand
    to deliver real-time performance
  • DoD cant pause the battle while WS-RM and TCP
    catch up if data saturated

26
Grids of Grids of Simple Services
  • Grids are managed collections of one or more
    services
  • A simple service is the smallest Grid
  • Services and Grids are linked by messages
  • Internally to service, functionalities are linked
    by methods
  • Link serices via methods ? messages ? streams
  • We are familiar with method-linked
    hierarchyLines of Code ? Methods ? Objects ?
    Programs ? Packages

27
Component Grids?
  • So we build collections of Web Services which we
    package as component Grids
  • Visualization Grid
  • Sensor Grid
  • Utility Computing Grid
  • Collaboration Grid
  • Earthquake Simulation Grid
  • Control Room Grid
  • Crisis Management Grid
  • Drug Discovery Grid
  • Bioinformatics Sequence Analysis Grid
  • Intelligence Data-mining Grid
  • We build bigger Grids by composing component Grids

28
Mediation and Transformation in a Grid of Grids
and Simple Services
Mediation and Transformation Services Distributed
Brokers between distributed ports
Mediation and Transformation Services Listen,
Queue Transform, Send
External facing Interfaces
Mediation and Transformation Services 1-10 ms
Overhead Use OGSA to Federate?
29
4 Cores is 3000 messages per second about one
message per millisecond per core for Opteron one
message per 2 ms for Sun Niagara core
30
Pentium 4 (3.4GHz) with 1GB of RAM while IBM- MQ
Series, Naradabrokering and the Message Bridge
are all running on it.NaradaBrokering running
in JMS emulation mode
Message Size Naradabrokering (JMS) to IBM MQ Naradabrokering (JMS) to IBM MQ IBM MQ to Naradabrokering (JMS) IBM MQ to Naradabrokering (JMS)
Message Size In-order Messages/second No Ordering Messages/second In-Order Messages/second No Ordering Messages/second
100 Bytes 350 530 320 310
1 Kbytes 330 500 290 290
4 Kbytes 200 390 220 210
31
Raw Data ? Data ? Information ?
Knowledge ? Wisdom
AnotherGrid
Decisions
AnotherGrid
SS
SS
SS
SS
FS
FS
OS
MD
MD
FS
Portal
Portal
FS
OS
OS
OS
SOAP Messages
OS
FS
FS
FS
FS
AnotherService
FS
MD
MD
MD
OS
MD
OS
OS
OS
OS
FS
Other Service
FS
FS
FS
FS
OS
MD
OS
OS
OS
FS
FS
FS
FS
MD
MD
MD
MD
FS
FS
Filter Service
OS
OS
FS
MetaData
AnotherGrid
FS
FS
FS
MD
Sensor Service
SS
SS
SS
SS
SS
SS
SS
SS
SS
SS
AnotherService
Write a Comment
User Comments (0)
About PowerShow.com