APAN e-Science Workshop e-Bio System for Bio-Knowledge Discovery - PowerPoint PPT Presentation

1 / 43

About This Presentation

Title:

APAN e-Science Workshop e-Bio System for Bio-Knowledge Discovery

Description:

... serving as the major sequence annotation tool Servers accept various kinds of queries ... Unifying Bioinformatics Services Ad hoc services Formal Web ... – PowerPoint PPT presentation

Number of Views:79

Avg rating:3.0/5.0

Slides: 44

Provided by: ssk6

Category:

more less

Transcript and Presenter's Notes

Title: APAN e-Science Workshop e-Bio System for Bio-Knowledge Discovery

1
APAN e-Science Workshope-Bio System for
Bio-Knowledge Discovery

2003.8.27
Sangsoo Kim
Natl Genome Informatn Ct.
Korea Res. Inst. of Biosci. Biotech.

2
Bio-Databases Servers

Contents
Bibliographic (Journal abstracts such as Medline)
Experimental data (Sequences or structures)
Results from annotation and analyses
Bioinformatic analysis tools
Purpose
Storing managing raw data
Querying for knowledge discovery
Sharing information with others
Serving others with online analysis

3
New Role of Databases

New discoveries of biological knowledge are
published in scientific journals
But journal space is limited and not suitable to
publish large amount of high throughput data
The supplementary information is provided in an
accompanying website
Readers can download the supplementary
information and analyze from different aspect
Combination with other information may surprise
with unexpected results
Journal publishers require supplementary
information deposited in public archives

4
Example - Nucleotide Sequence Repositories

Nucleotide sequences discovered by sequencing
experiments are deposited in any one of the
public archives and the journal paper list the
accession numbers only (without deposition, you
cannot publish sequence discovery in journals)
Public archives are
DDBJ operated by CIB, NIG in Japan
EMBL operated by EMBL-EBI in UK
GenBank operated by NCBI, NIH in USA
The contents of these archives are exchanged
daily and freely accessible to everybody
Now extended to archive DNA chip data as well

5
Growth of GenBankA Nucleotide Sequence Repository
6
Entrez Home Page
RTFM
7
Entrez Display
GenBank as HTML
FASTA as HTML
8
Example BLAST Servers

Originally developed to compare my sequence to
those in the repository in order to check whether
mine is novel or not
Extended to detect distantly related sequences,
serving as the major sequence annotation tool
Servers accept various kinds of queries and
return alignment results over WWW
The most widely used bioinformatic tool
For the analysis of many sequences, better to use
local installation

9
BLAST (Basic Local Alignment Sequence Tool)
http//www.ncbi.nlm.nih.gov/BLAST
program query database blastn dna dna blastp prote
in protein blastx dna (6x) protein tblastn protein
dna (6x) tblastx dna (6x) dna (6x)
RTFM
10
BLASTN (Cont'd)
Descriptions
Alignments
11
Example Derived Databases

Swiss-Prot PIR
Proteins are predicted from deposited nucleotide
sequences, either being mRNA or genomic DNA
Functions and features of the protein is
annotated manually by experts
Protein motifs
Prosite, pfam, BLOCKS, InterPro
Keyword querying and motif detection of users
sequence
Gene Ontology
Hierarchical organization of biological terms
Cataloging associated gene products

12
ExPASy (http//www.expasy.ch)
Expert Protein Analysis System
13
NiceProt View
14
Gene Ontology

Systematic classification of biological
terminology
Molecular function
Biological process
Cellular component
Controlled vocabulary
Associated GENE list

15
(No Transcript)
16
Data Mining

Objective
Discovery of (biological) knowledge by querying
information in the databases and comprehending it
Problems
Too many databases
Different protocols for access
Lack of standards
Poor quality or propagation of errors
Solutions
Data warehousing or federated databases

17
Catalog of Bio-DBs arranged by Data Domain
18
Database of Databases

Data warehousing
Collect all databases by mirroring
Store in a unified format
Entrez (NCBI) or SRS (EBI)
Powerful but heavy maintenance load
Federated databases
Maintained by participating members
Accessed by common protocols
Bio-DAS or Web Services via SOAP/XML
Next generation technology, but dependent on both
the cooperation by members and Internet bandwidth

19
www.ngic.re.kr
20
www.ncbi.nih.gov /LocusLink
21
New Data Types

Textual
Nucleotide or amino acid sequences
Associated feature annotation
Bibliographical texts
Numeric
Gene expression profiles
Results from statistical analysis
Graphical
Protein-protein interaction network
Genetic network
Biochemical reaction pathways

22
(No Transcript)
23
(No Transcript)
24
(No Transcript)
25
Building a Nation from a Land of City States

Lincoln D. Stein
Cold Spring Harbor Laboratory

26
Italy in the Middle Ages
27
Bioinformatics, ca. 2002
Bioinformatics In the XXI Century
28
Making Easy Things Hard
Give me all human sequences submitted to
GenBank/EMBL last week.
29
Lots of ways to do it

Download weekly update of GenBank/EMBL from FTP
site
Use official network-based interfaces to data
NCBI toolkit
EBI CORBA XEMBL servers
Use friendly web interfaces at NCBI, EBI

30
Perl/Java/Python to the Rescue

One script to do the web fetch
Another to parse the file format
A third to move into private database
A fourth to repeat this weekly
Result
6,719 scripts that do the same thing
None of them work together

31
Whats Wrong with This?

My EMBL fetcher is poorly documented so you write
your own
Your fetcher wont work with my parser
My parser wont work with your fetcher
Weve now wasted 20 hours rather than 10
Multiply this by 6,719

32
Whats else is Wrong?

NCBI/EBI tweaks something
6,719 scripts fail at once
6,719 bioinformaticists tear their hair
21,261 biologists curse the bioinformaticists
6,719 bioinformaticists curse their own existence

33
Unifying Bioinformatics Services

MIMBD Meetings on the Interconnection of
Molecular Biology Databases
Federated models Gaea, Kleisli
Data warehouses GUS, MODs, Ensembl, UCSC
Ad hoc web services
Formal web services

34
Ad hoc services
BioXXX
Conf file
Your Script
35
Formal Web Services
GO Service
BLAST Service
SeqFetch Service
BLAT Service
SeqFetch Service
Microarray Service
36
Formal Web Services
GO Service
BLAST Service
SeqFetch Service
BLAT Service
SeqFetch Service
Service Registry
Microarray Service
37
Formal Web Services
GO Service
BLAST Service
SeqFetch Service
BLAT Service
SeqFetch Service
BioXXX
Service Registry
Microarray Service
Microarray Service
Your Script
38
Technical Infrastructure is Here

Common vocabulary GO
Transport format XML
Data definition language XSD
Wire protocol SOAP
Service definition language WSDL
Service registry UDDI

(almost)
39
Distributed Annotation Systemhttp//www.biodas.or
g
AC003027
M10154
AC005122
Thursday 1030 AM Canyon IV
40
Europe, ca 2000
41
Bioinformatics, ca 2010?
42
Collection and Sharing of National Genome
Information
KNIH
Human
Microbial
NGIC
Proteome
Plant
Animal
Crop
Ag-Bio
43
National Genome Information Network
KNIH
Human
Microbial
NGIC
Proteome
Plant
Animal
Crop
Ag-Bio

Write a Comment

User Comments (0)