Title: Summary of SDM ETC Kickoff for the Data Integration Task
1Summary of SDM ETC Kickoff for theData
Integration Task
Calton Pu Ling Liu David Buttler
Bertram LudaescherAmarnath Gupta Mladen VoukTom
Potok
2People involved
- People
- Terence Critchlow (LLNL)
- Calton Pu (GT)
- Ling Liu (GT)
- David Buttler (GT)
- Bertram Ludaescher (UCSD)
- Amarnath Gupta (UCSD)
- TDB
- Ph.D. student at Georgia Tech
- Developer at UCSD
- Mladen Vouk / Tom Potok NCSU / ORNL
- Commitment per institution
- LLNL
- 0.25 (likely) 1.0 FTE
- Georgia Tech
- 2 Ph.D. Students
- X months Caltons time
- Y months Lings time
- UCSD
- 1 FTE
- 1 month Bertrams time
- 1 month Guptas time
- Agent team
- 2-4 months over the course of the year
3Application ties
- Primary domain bioinformatics
- Secondary domains
- Material science
- Air / water quality
- Scientists (early adopters)
- Matt Coleman (LLNL)
- Allen Christian (LLNL)
- Phil Bourn (PDB)
Contacted by Terence
Contacted by Bertram / Gupta
4Use Case 1 Finding out everything about a
sequence
- Bob starts with one or several DNA or protein
sequences that he wants to analyze - OR Bob finds protein or gene sequences of
interest by querying databases/web sites for
metabolic pathways/cell signaling pathways (e.g.,
KEGG) - OR Bob looks at a database of microarray
experiments and chooses those genes that exhibit
specified patterns of co-occurrence (what subsets
of genes go hand in hand across a large number
of experiments) - The relevant sequences are submitted to one or
more sequence databases for blast search - The homologous sequences found in the searched
database(s) are - directly returned to the user, sorted by score
- OR post-processed by the mediator (duplicate
elimination, groupings, links to additional
contextual data) - The resulting sequences can be queried for their
associated information - Bob can use these sequences for new similarity
searches
5Use Case 1 Additional scenerios
- Helpful features for users
- Multiple sequences entered through a single file
- Ability to tie in other programs to preprocess
data before passing it to wrappers / mediator - Follow-up searches may be more than just blasts
- Selection / project / join queries through the
interface - Tie in other tools such as RasMol
- Other types of search such as phiblast, psiblast
or other structural similarity searches
6Data Integration Architecture
if invoked, pre-processes query parameters and
post-processes results
Query Dispatch and Collection (QDaC)
XML Wrapper
XQuery (subsets e.g. Sel/Proj)
API
Medline
VIPAR
Integration component / KB-Mediator (KBM)
XML Wrapper
PDB
CM Wrapper
CM Wrapper
XML Wrapper
df
XML Wrapper
CM Wrapper
Source / Agent MetaData Registry
XQuery interface Select/project only
XWRAP Wrapper Generator
7Architecture comments
- Communication protocol
- Use agent technology to communicate between
components - Dont use full capabilities when on the same
machine - Between QDaC and wrappers, QDaC and mediator,
mediator and CMs, CMs and wrappers - NOT expected between wrappers and source
- Embedded representation
- XML sources are queried using a subset of XQuery
(fragments) - Primarily concerned with selection and projection
not join - Query results are returned in XML
8Architecture comments
- Meta-data repository (metadata server)
- Contains
- Location, schema
- Query capabilities (blast, keyword, XPath) of
sources - May be duplicated / shared between QDaC and KBM
- Eventually may be treated as an agent
- External programs
- Will be included as preprocessing steps
- May need wrappers to handle translations properly
- Will be tied in to interface where possible
- Gives users access to tools they need / want /
are familiar with
9Architecture comments
- Expect most wrappers to be generated by XWrap in
practice, but it shouldnt matter as long as they
follow the specified protocol and representation - VIPAR used to wrap publication sources
- Simple SQL wrapper for direct database access
- Definitions
- CM conceptual mapping a wrapper that
translates source-specific XML into
10Year 1 deliverables
- Send XQuery command to BLAST sources, combine
results, and return to user interface - Interact with at least 4 sources
- Integration component will have at least 2
sources - QDaC will directly query NCBI and at least one
other - Operate QDaC and mediator in a distributed
environment - Interface / QDaC at LLNL and mediator at UCSD
Have agent stubs at UCSD and LLNL passing text
strings within 3 months
11Detailed tasks
- Interface (LLNL)
- Extended to handle blast against new sources
- Some of which are not integrated
- QDaC (LLNL)
- Identify available wrappers from meta-data
- This includes the SDSC component
- Query wrappers using XQuery
- Collect and sort responses
- Adopt agent protocol
12Detailed tasks
- XWrap (GT)
- Accept XPath/XQuery input
- Handle complex BLAST interfaces
- Adopt agent protocol
- Mediator (UCSD)
- Model of pathways, gene and protein expressions
gt ontology to be used for driving BLAST queries
and interpreting their results - Accept XQuery queries
- Identify available sources from meta-data
- Modify CM wrappers to generate XQuery commands
- Agent technology (ORNL, LLNL, UCSD)
- Use VIPAR to wrap Medline database
- Use protocols to communicate between LLNL and
SDSC components
13Administrative
- Reports
- Quarterly reports
- to be collected by Terence, (possibly)
summarized, and forwarded on to Arie - Short bulleted form (word file or plain text
preferred) - Center-wide communications
- Telecon 1st Monday of the month 1100 1200 PST
- It is ok to miss this
- Semi-annual meetings
- next at ORNL in mid-March
- Center web site will point to individual task
sites - Shared CVS repository at NC State
- Primarily for major releases / sharing code
between tasks
14Administrative
- Advisory committee
- Potential names from bioinformatics area
- Carole Goble (Univ of Manchester), Tom Slezak
(LLNL), ??? - Unclear who pays travel for members
- This is for us, so they will not be generating
reports
15Task specific
- Mail list
- For our task ONLY sdmctr-integrate_at_llnl.gov is
being set up - Will be archived
- Site contacts
- Terence (LLNL)
- Bertram (UCSD)
- Calton (GT)
- Tom (Agents)
- Web site
- Being set up at GT
- Use main CVS repository for major releases
- Code sharing option 1
- Task-only CVS repository for day-to-day work
- Unlikely LLNL could host this service
- Code sharing option 2
- Site specific cvs repositories for day-to-day
work - Alexandria repository for inter-task code sharing
- https//www-casc.llnl.gov/alexandria/
- Disadv tar-balls
- Adv we dont all need an account on the
repository machine