Summary of SDM ETC Kickoff for the Data Integration Task - PowerPoint PPT Presentation

About This Presentation
Title:

Summary of SDM ETC Kickoff for the Data Integration Task

Description:

Summary of SDM ETC Kickoff for the Data Integration Task – PowerPoint PPT presentation

Number of Views:42
Avg rating:3.0/5.0
Slides: 16
Provided by: ling151
Learn more at: https://sdm.lbl.gov
Category:
Tags: etc | sdm | data | integration | kickoff | summary | task | xu

less

Transcript and Presenter's Notes

Title: Summary of SDM ETC Kickoff for the Data Integration Task


1
Summary of SDM ETC Kickoff for theData
Integration Task
  • Terence Critchlow

Calton Pu Ling Liu David Buttler
Bertram LudaescherAmarnath Gupta Mladen VoukTom
Potok
2
People involved
  • People
  • Terence Critchlow (LLNL)
  • Calton Pu (GT)
  • Ling Liu (GT)
  • David Buttler (GT)
  • Bertram Ludaescher (UCSD)
  • Amarnath Gupta (UCSD)
  • TDB
  • Ph.D. student at Georgia Tech
  • Developer at UCSD
  • Mladen Vouk / Tom Potok NCSU / ORNL
  • Commitment per institution
  • LLNL
  • 0.25 (likely) 1.0 FTE
  • Georgia Tech
  • 2 Ph.D. Students
  • X months Caltons time
  • Y months Lings time
  • UCSD
  • 1 FTE
  • 1 month Bertrams time
  • 1 month Guptas time
  • Agent team
  • 2-4 months over the course of the year

3
Application ties
  • Primary domain bioinformatics
  • Secondary domains
  • Material science
  • Air / water quality
  • Scientists (early adopters)
  • Matt Coleman (LLNL)
  • Allen Christian (LLNL)
  • Phil Bourn (PDB)

Contacted by Terence
Contacted by Bertram / Gupta
4
Use Case 1 Finding out everything about a
sequence
  • Bob starts with one or several DNA or protein
    sequences that he wants to analyze
  • OR Bob finds protein or gene sequences of
    interest by querying databases/web sites for
    metabolic pathways/cell signaling pathways (e.g.,
    KEGG)
  • OR Bob looks at a database of microarray
    experiments and chooses those genes that exhibit
    specified patterns of co-occurrence (what subsets
    of genes go hand in hand across a large number
    of experiments)
  • The relevant sequences are submitted to one or
    more sequence databases for blast search
  • The homologous sequences found in the searched
    database(s) are
  • directly returned to the user, sorted by score
  • OR post-processed by the mediator (duplicate
    elimination, groupings, links to additional
    contextual data)
  • The resulting sequences can be queried for their
    associated information
  • Bob can use these sequences for new similarity
    searches

5
Use Case 1 Additional scenerios
  • Helpful features for users
  • Multiple sequences entered through a single file
  • Ability to tie in other programs to preprocess
    data before passing it to wrappers / mediator
  • Follow-up searches may be more than just blasts
  • Selection / project / join queries through the
    interface
  • Tie in other tools such as RasMol
  • Other types of search such as phiblast, psiblast
    or other structural similarity searches

6
Data Integration Architecture
if invoked, pre-processes query parameters and
post-processes results
Query Dispatch and Collection (QDaC)
XML Wrapper
XQuery (subsets e.g. Sel/Proj)

API
Medline
VIPAR
Integration component / KB-Mediator (KBM)
XML Wrapper
PDB
CM Wrapper
CM Wrapper
XML Wrapper
df
XML Wrapper
CM Wrapper
Source / Agent MetaData Registry
XQuery interface Select/project only
XWRAP Wrapper Generator
7
Architecture comments
  • Communication protocol
  • Use agent technology to communicate between
    components
  • Dont use full capabilities when on the same
    machine
  • Between QDaC and wrappers, QDaC and mediator,
    mediator and CMs, CMs and wrappers
  • NOT expected between wrappers and source
  • Embedded representation
  • XML sources are queried using a subset of XQuery
    (fragments)
  • Primarily concerned with selection and projection
    not join
  • Query results are returned in XML

8
Architecture comments
  • Meta-data repository (metadata server)
  • Contains
  • Location, schema
  • Query capabilities (blast, keyword, XPath) of
    sources
  • May be duplicated / shared between QDaC and KBM
  • Eventually may be treated as an agent
  • External programs
  • Will be included as preprocessing steps
  • May need wrappers to handle translations properly
  • Will be tied in to interface where possible
  • Gives users access to tools they need / want /
    are familiar with

9
Architecture comments
  • Expect most wrappers to be generated by XWrap in
    practice, but it shouldnt matter as long as they
    follow the specified protocol and representation
  • VIPAR used to wrap publication sources
  • Simple SQL wrapper for direct database access
  • Definitions
  • CM conceptual mapping a wrapper that
    translates source-specific XML into

10
Year 1 deliverables
  • Send XQuery command to BLAST sources, combine
    results, and return to user interface
  • Interact with at least 4 sources
  • Integration component will have at least 2
    sources
  • QDaC will directly query NCBI and at least one
    other
  • Operate QDaC and mediator in a distributed
    environment
  • Interface / QDaC at LLNL and mediator at UCSD

Have agent stubs at UCSD and LLNL passing text
strings within 3 months
11
Detailed tasks
  • Interface (LLNL)
  • Extended to handle blast against new sources
  • Some of which are not integrated
  • QDaC (LLNL)
  • Identify available wrappers from meta-data
  • This includes the SDSC component
  • Query wrappers using XQuery
  • Collect and sort responses
  • Adopt agent protocol

12
Detailed tasks
  • XWrap (GT)
  • Accept XPath/XQuery input
  • Handle complex BLAST interfaces
  • Adopt agent protocol
  • Mediator (UCSD)
  • Model of pathways, gene and protein expressions
    gt ontology to be used for driving BLAST queries
    and interpreting their results
  • Accept XQuery queries
  • Identify available sources from meta-data
  • Modify CM wrappers to generate XQuery commands
  • Agent technology (ORNL, LLNL, UCSD)
  • Use VIPAR to wrap Medline database
  • Use protocols to communicate between LLNL and
    SDSC components

13
Administrative
  • Reports
  • Quarterly reports
  • to be collected by Terence, (possibly)
    summarized, and forwarded on to Arie
  • Short bulleted form (word file or plain text
    preferred)
  • Center-wide communications
  • Telecon 1st Monday of the month 1100 1200 PST
  • It is ok to miss this
  • Semi-annual meetings
  • next at ORNL in mid-March
  • Center web site will point to individual task
    sites
  • Shared CVS repository at NC State
  • Primarily for major releases / sharing code
    between tasks

14
Administrative
  • Advisory committee
  • Potential names from bioinformatics area
  • Carole Goble (Univ of Manchester), Tom Slezak
    (LLNL), ???
  • Unclear who pays travel for members
  • This is for us, so they will not be generating
    reports

15
Task specific
  • Mail list
  • For our task ONLY sdmctr-integrate_at_llnl.gov is
    being set up
  • Will be archived
  • Site contacts
  • Terence (LLNL)
  • Bertram (UCSD)
  • Calton (GT)
  • Tom (Agents)
  • Web site
  • Being set up at GT
  • Use main CVS repository for major releases
  • Code sharing option 1
  • Task-only CVS repository for day-to-day work
  • Unlikely LLNL could host this service
  • Code sharing option 2
  • Site specific cvs repositories for day-to-day
    work
  • Alexandria repository for inter-task code sharing
  • https//www-casc.llnl.gov/alexandria/
  • Disadv tar-balls
  • Adv we dont all need an account on the
    repository machine
Write a Comment
User Comments (0)
About PowerShow.com