APAN e-Science Workshop e-Bio System for Bio-Knowledge Discovery - PowerPoint PPT Presentation

1 / 43
About This Presentation
Title:

APAN e-Science Workshop e-Bio System for Bio-Knowledge Discovery

Description:

... serving as the major sequence annotation tool Servers accept various kinds of queries ... Unifying Bioinformatics Services Ad hoc services Formal Web ... – PowerPoint PPT presentation

Number of Views:79
Avg rating:3.0/5.0
Slides: 44
Provided by: ssk6
Category:

less

Transcript and Presenter's Notes

Title: APAN e-Science Workshop e-Bio System for Bio-Knowledge Discovery


1
APAN e-Science Workshope-Bio System for
Bio-Knowledge Discovery
  • 2003.8.27
  • Sangsoo Kim
  • Natl Genome Informatn Ct.
  • Korea Res. Inst. of Biosci. Biotech.

2
Bio-Databases Servers
  • Contents
  • Bibliographic (Journal abstracts such as Medline)
  • Experimental data (Sequences or structures)
  • Results from annotation and analyses
  • Bioinformatic analysis tools
  • Purpose
  • Storing managing raw data
  • Querying for knowledge discovery
  • Sharing information with others
  • Serving others with online analysis

3
New Role of Databases
  • New discoveries of biological knowledge are
    published in scientific journals
  • But journal space is limited and not suitable to
    publish large amount of high throughput data
  • The supplementary information is provided in an
    accompanying website
  • Readers can download the supplementary
    information and analyze from different aspect
  • Combination with other information may surprise
    with unexpected results
  • Journal publishers require supplementary
    information deposited in public archives

4
Example - Nucleotide Sequence Repositories
  • Nucleotide sequences discovered by sequencing
    experiments are deposited in any one of the
    public archives and the journal paper list the
    accession numbers only (without deposition, you
    cannot publish sequence discovery in journals)
  • Public archives are
  • DDBJ operated by CIB, NIG in Japan
  • EMBL operated by EMBL-EBI in UK
  • GenBank operated by NCBI, NIH in USA
  • The contents of these archives are exchanged
    daily and freely accessible to everybody
  • Now extended to archive DNA chip data as well

5
Growth of GenBankA Nucleotide Sequence Repository
6
Entrez Home Page
RTFM
7
Entrez Display
GenBank as HTML
FASTA as HTML
8
Example BLAST Servers
  • Originally developed to compare my sequence to
    those in the repository in order to check whether
    mine is novel or not
  • Extended to detect distantly related sequences,
    serving as the major sequence annotation tool
  • Servers accept various kinds of queries and
    return alignment results over WWW
  • The most widely used bioinformatic tool
  • For the analysis of many sequences, better to use
    local installation

9
BLAST (Basic Local Alignment Sequence Tool)
http//www.ncbi.nlm.nih.gov/BLAST
program query database blastn dna dna blastp prote
in protein blastx dna (6x) protein tblastn protein
dna (6x) tblastx dna (6x) dna (6x)
RTFM
10
BLASTN (Cont'd)
Descriptions
Alignments
11
Example Derived Databases
  • Swiss-Prot PIR
  • Proteins are predicted from deposited nucleotide
    sequences, either being mRNA or genomic DNA
  • Functions and features of the protein is
    annotated manually by experts
  • Protein motifs
  • Prosite, pfam, BLOCKS, InterPro
  • Keyword querying and motif detection of users
    sequence
  • Gene Ontology
  • Hierarchical organization of biological terms
  • Cataloging associated gene products

12
ExPASy (http//www.expasy.ch)
Expert Protein Analysis System
13
NiceProt View
14
Gene Ontology
  • Systematic classification of biological
    terminology
  • Molecular function
  • Biological process
  • Cellular component
  • Controlled vocabulary
  • Associated GENE list

15
(No Transcript)
16
Data Mining
  • Objective
  • Discovery of (biological) knowledge by querying
    information in the databases and comprehending it
  • Problems
  • Too many databases
  • Different protocols for access
  • Lack of standards
  • Poor quality or propagation of errors
  • Solutions
  • Data warehousing or federated databases

17
Catalog of Bio-DBs arranged by Data Domain
18
Database of Databases
  • Data warehousing
  • Collect all databases by mirroring
  • Store in a unified format
  • Entrez (NCBI) or SRS (EBI)
  • Powerful but heavy maintenance load
  • Federated databases
  • Maintained by participating members
  • Accessed by common protocols
  • Bio-DAS or Web Services via SOAP/XML
  • Next generation technology, but dependent on both
    the cooperation by members and Internet bandwidth

19
www.ngic.re.kr
20
www.ncbi.nih.gov /LocusLink
21
New Data Types
  • Textual
  • Nucleotide or amino acid sequences
  • Associated feature annotation
  • Bibliographical texts
  • Numeric
  • Gene expression profiles
  • Results from statistical analysis
  • Graphical
  • Protein-protein interaction network
  • Genetic network
  • Biochemical reaction pathways

22
(No Transcript)
23
(No Transcript)
24
(No Transcript)
25
Building a Nation from a Land of City States
  • Lincoln D. Stein
  • Cold Spring Harbor Laboratory

26
Italy in the Middle Ages
27
Bioinformatics, ca. 2002
Bioinformatics In the XXI Century
28
Making Easy Things Hard
Give me all human sequences submitted to
GenBank/EMBL last week.
29
Lots of ways to do it
  • Download weekly update of GenBank/EMBL from FTP
    site
  • Use official network-based interfaces to data
  • NCBI toolkit
  • EBI CORBA XEMBL servers
  • Use friendly web interfaces at NCBI, EBI

30
Perl/Java/Python to the Rescue
  • One script to do the web fetch
  • Another to parse the file format
  • A third to move into private database
  • A fourth to repeat this weekly
  • Result
  • 6,719 scripts that do the same thing
  • None of them work together

31
Whats Wrong with This?
  • My EMBL fetcher is poorly documented so you write
    your own
  • Your fetcher wont work with my parser
  • My parser wont work with your fetcher
  • Weve now wasted 20 hours rather than 10
  • Multiply this by 6,719

32
Whats else is Wrong?
  • NCBI/EBI tweaks something
  • 6,719 scripts fail at once
  • 6,719 bioinformaticists tear their hair
  • 21,261 biologists curse the bioinformaticists
  • 6,719 bioinformaticists curse their own existence

33
Unifying Bioinformatics Services
  • MIMBD Meetings on the Interconnection of
    Molecular Biology Databases
  • Federated models Gaea, Kleisli
  • Data warehouses GUS, MODs, Ensembl, UCSC
  • Ad hoc web services
  • Formal web services

34
Ad hoc services
BioXXX
Conf file
Your Script
35
Formal Web Services
GO Service
BLAST Service
SeqFetch Service
BLAT Service
SeqFetch Service
Microarray Service
36
Formal Web Services
GO Service
BLAST Service
SeqFetch Service
BLAT Service
SeqFetch Service
Service Registry
Microarray Service
37
Formal Web Services
GO Service
BLAST Service
SeqFetch Service
BLAT Service
SeqFetch Service
BioXXX
Service Registry
Microarray Service
Microarray Service
Your Script
38
Technical Infrastructure is Here
  • Common vocabulary GO
  • Transport format XML
  • Data definition language XSD
  • Wire protocol SOAP
  • Service definition language WSDL
  • Service registry UDDI

(almost)
39
Distributed Annotation Systemhttp//www.biodas.or
g
AC003027
M10154
AC005122
Thursday 1030 AM Canyon IV
40
Europe, ca 2000
41
Bioinformatics, ca 2010?
42
Collection and Sharing of National Genome
Information
KNIH
Human
Microbial
NGIC
Proteome
Plant
Animal
Crop
Ag-Bio
43
National Genome Information Network
KNIH
Human
Microbial
NGIC
Proteome
Plant
Animal
Crop
Ag-Bio
Write a Comment
User Comments (0)
About PowerShow.com