Genome, Protein and Model Organism Databases - PowerPoint PPT Presentation

1 / 160
About This Presentation
Title:

Genome, Protein and Model Organism Databases

Description:

Genome, Protein and Model Organism Databases – PowerPoint PPT presentation

Number of Views:684
Avg rating:3.0/5.0
Slides: 161
Provided by: Edw485
Category:

less

Transcript and Presenter's Notes

Title: Genome, Protein and Model Organism Databases


1
Genome, ProteinandModel Organism Databases
Anne Estreicher Swiss-Prot Group Swiss Institute
of Bioinformatics Geneva Switzerland Anne.Estrei
cher_at_isb-sib.ch
Bioinformatic and Comparative Genome Analysis
Course HKU-Pasteur Research Centre - Hong Kong,
China August 17 - August 29, 2009
2
  • Outline
  • Introduction (definitions, history)
  • From DNA sequence to genomic tools
  • The flow of information from DNA to proteins
  • Protein sequence databases
  • MODs at a glance

3
What is a database ?
  • A collection of related data, which are
  • structured
  • searchable
  • updated periodically
  • cross-referenced
  • Includes also associated tools necessary for
    access/query, download, etc.

4
  • Why do we need databases ?
  • Data need to be stored, curated and made
    available for analysis and knowledge discovery
  • Efficient way of sharing data, independently of
    regular publications
  • Essential resources for both experimental and
    computational biologists

5
Databases in biology not a new issue
  • 1954 First protein sequence (insulin by F.
    Sanger)
  • 1965 Atlas of Protein Sequence and Structure (65
    proteins)

6
The first protein sequence "database" by
Margaret Dayhoff (1965) contained 65 proteins
7
Databases not a new issue
  • 1954 First protein sequence (insulin by F.
    Sanger)
  • 1965 Atlas of Protein Sequence and Structure (65
    proteins)
  • Mid 70s Improvements in DNA sequencing
  • 1979 Los Alamos Sequence Library (Walter Goad)
  • 1980 80 genes fully sequenced
  • -gt Need to store the data and to make them
    available for analysis (in format acceptable for
    human eyes and machines)
  • -gt ARCHIVE
  • -gt RACE for the central position in life
    sciences

And the winner is
8
Databases not a new issue
EMBL-Bank - Europe 1980 GenBank - USA 1982 D
DBJ - Asia 1986
leading to the establishment of the INSDC
(International Nucleotide Sequence Database
Collaboration) -gt daily exchanges of data
9
www.insdc.org
10
  • EMBL-BANK - GenBank - DDBJ
  • Main resources for DNA and RNA sequences
  • Used to be retrieved from publications -gt direct
    submissions from individual researchers, genome
    sequencing projects and patent applications
  • Journal publishers generally require sequence
    deposition prior to publication so that an
    accession number can be included in the paper.
  • 1. True for nucleic acid, not for protein
    sequences
  • 2. Not always put into practice
  • gt Not submitted sequences are LOST!!!
  • Archives (primary databases)
  • data belong to submitters

11
EMBL-BANK - GenBank - DDBJ Archive (primary
databases) gt data belong to the submitter
  • Minimal checks, such as vector contamination
  • Annotation by the submitters

12
Databases not a new issue
  • 1954 First protein sequence (insulin by F.
    Sanger)
  • 1965 Atlas of Protein Sequence and Structure (65
    proteins)
  • 1979 Los Alamos Sequence Library (Walter Goad)
    DNA
  • 1982 EMBL-Bank - DNA
  • 1984 GenBank DNA
  • 1986 DDBJ - DNA

13
Databases not a new issue
  • 1954 First protein sequence (insulin by F.
    Sanger)
  • 1965 Atlas of Protein Sequence and Structure (65
    proteins)
  • 1979 Los Alamos Sequence Library (Walter Goad)
    DNA
  • 1982 EMBL-Bank - DNA
  • 1984 GenBank DNA
  • 1986 DDBJ - DNA
  • -gt ARCHIVES (primary databases) may not be
    sufficient
  • -gt need to annotate the data to produce KNOWLEDGE
  • 1986 Swiss-Prot protein sequences a paradigm
    for annotated (secondary) databases

14
The Swiss-Prot concept
  • non-redundant
  • Protein products of
  • 1 gene / 1 species -gt 1 entry,
  • Manually annotated (gt curator judgement on data
    !),
  • Highly cross-referenced (1st life-science
    database to provide cross-references) (links to gt
    130 databases from www.uniprot.org).

15
Databases not a new issue
  • 1954 First protein sequence (insulin by F.
    Sanger)
  • 1965 Atlas of Protein Sequence and Structure (65
    proteins)
  • 1979 Los Alamos Sequence Library (Walter Goad)
    DNA
  • 1982 EMBL-Bank - DNA
  • 1984 GenBank DNA
  • Protein information resource (PIR) Protein
    sequences
  • 1986 DDBJ DNA
  • Swiss-Prot protein sequences
  • 1996 TrEMBL (Translated EMBL) Protein sequences
  • Complement of Swiss-Prot to cope with the
    increasing amount of new sequences AUTOMATIC
    ANNOTATION !

16
UniProtKB/Swiss-Prot growth
Swiss-Prot rel. 57.5 (07-Jul-2009) 470369
entries
1996 creation of TrEMBL Swiss-Prot 52205
entries TrEMBL 61137 entries
Number of entries
Release number
1986 3939 entries
17
UniProtKB growth
TrEMBL rel.40.5 (07-Jul-2009) 8594382
entries Swiss-Prot rel.57.5 (07-Jul-2009)
470369 entries
  • TrEMBL growth (sequences/day)
  • 2004 ? 1500
  • 2006-2007 ? 3500
  • ? gt5000
  • ? 8000

Number of entries
TrEMBL Automated curation
Swiss-Prot Manual curation
Release number
1986
1996
2009
18
  • New challenge
  • Flood of data -gt need to be stored, curated and
    made available for analysis and knowledge
    discovery

19
(R)evolution of these last 20 years
  • Life sciences used to be rich in hypotheses,
    well-off in knowledge and poor in data
  • Today they are very rich in data, not so well-off
    in knowledge and very poor in hypotheses.

20
Science (1993) 262, 502
21
EMBL Database Growth http//www.ebi.ac.uk/embl/Ser
vices/DBStats/
22
http//www.ncbi.nlm.nih.gov/genomes/static/gpstat.
html http//www.ncbi.nlm.nih.gov/genomes/GenomesHo
me.cgi?taxid10239hoptstat
In 4 months, 374 new genomes and 77 were
completed 100 genomes/month (in 2008 -gt 50
genomes/month)
 2360 viral ( viroid) genomes gt Total
5600 genomes 
23
http//genomesonline.org/index2.htm
24
http//www.genomesonline.org/gold.cgi
25
(No Transcript)
26
http//www.genomesonline.org/gold.cgi
27
Metagenomicsstudy of genetic material recovered
directly from environmental samples
  • Global Ocean Sampling (C. Venter)
  • Whale fall
  • Soil, sand beach, New-York air,
  • Human fluids, mouse gut

Venters Sorcerer II
28
  • Flood in the world of proteins
  • 1965 first protein sequence "database" by
    Margaret Dayhoff (65 proteins)
  • July 2009 20 millions unique protein sequence
    (source UniParc - http//www.uniprot.org/uniparc/)
  • UniParc
  • non-redundant database that contains most of the
    publicly available protein sequences in the world
    (includes sequences from EMBL-Bank/DDBJ/GenBank
    nucleotide sequence databases, Ensembl, FlyBase,
    H-Invitational Database (H-Inv), International
    Protein Index (IPI), Patent Offices (EPO, JPO and
    USPTO), PIR-PSD, Protein Data Bank (PDB), Protein
    Research Foundation (PRF), RefSeq, Saccharomyces
    Genome database (SGD), TAIR Arabidopsis thaliana
    Information Resource, TROME, UniProtKB/Swiss-Prot
    and TrEMBL, Vertebrate Genome Annotation database
    (VEGA) and WormBase).

29
  • New challenge
  • Flood of data
  • Flood of databases

30
NAR 1st issue of the year is always dedicated to
databases "clean" list of databases provided (!
not exhaustive !)
31
The NAR Online Molecular Biology Database
collection in 2009 A total of 1170 databases
(19 obsolete removed) http//www.oxfordjournals.or
g/nar/database/a/
32
NAR "clean" list of databases http//www.oxfordjou
rnals.org/nar/database/a/
33
Most recent NAR paper about the database (not
available for all db, some described in other
journals)
34
A "clean" list of can be found in the NAR online
molecular biology database collection http//www.
oxfordjournals.org/nar/database/a/
35
(No Transcript)
36
BIOLOGICAL DATABASE CATEGORIES
  • Databases of nucleic acid sequences (RNA, DNA)
  • Databases of protein sequences
  • Databases of protein motifs and protein domains
  • Databases of structures
  • Databases of genomes
  • Databases of genes
  • Databases of expression profiles
  • Databases of SNPs and mutations
  • Databases of metabolic pathways
  • Databases of protein interactions
  • Databases of taxonomy

Databases containing sequences or data directly
derived from sequences.
37
DNA sequences What ? Where ? How ? genomic
tools NCBI UCSC
38
Accession number Molecule type Date of
submission Definition
GenBank entry AF415175 http//www.ncbi.nlm.nih.gov
/nuccore/16589063
Nucleotide sequence
39
Accession number Molecule type Date of
submission Definition
Taxonomy
Nucleotide sequence
40
Accession number Molecule type Date of
submission Definition
Taxonomy
References
Nucleotide sequence
41
Accession number Molecule type Date of
submission Definition
Taxonomy
References
Organism Molecule type Chromosomal
location Tissue type Gene name CDS annotation gt
protein sequence Protein IDentifier (PID
stable identifier version number)
Features Information provided by the
submitter May include annotation of the sequence
Nucleotide sequence
42
Protein sequence
43
"Features"  may provide much more
information depending upon the sequence and the
submitter
3end of chromosome Y  EMBL AJ271736
44
Very similar view, links and options from the 3
sites EMBL-Bank GenBank - DDBJ
http//www.ddbj.nig.ac.jp/
http//www.ebi.ac.uk/embl/
http//www.ncbi.nlm.nih.gov/
45
How to find a DNA sequence at the NCBI
46
http//www.ncbi.nlm.nih.gov/
47
Databases _at_ NCBI http//www.ncbi.nlm.nih.gov/Datab
ase/datamodel/index.html
The Entrez system integrated, text-based search
and retrieval system used at NCBI for the major
databases, including PubMed, Nucleotide and
Protein Sequences, Protein Structures, Complete
Genomes, Taxonomy, and others gt Maximal
interconnectivity
48
Databases _at_ NCBI http//www.ncbi.nlm.nih.gov/Datab
ase/datamodel/index.html
49
Simple search with a EMBL-Bank/GenBank/DDBJ
accession number
50
(No Transcript)
51
(No Transcript)
52
Searching from a bibliographic reference
53
(No Transcript)
54
Search results 2 and 3 -gt accession numbers
provided by the authors in the article -gt GenBank
records
Search result 1 -gt corresponds to the RefSeq
database
55
  • RefSeq (Reference Sequence)
  • Provides a comprehensive, integrated,
    non-redundant, well-annotated set of sequences,
    including genomic DNA, transcripts, and proteins
  • Most data extracted from GenBank -gt choice of a
    reference sequence and annotation (no documented
    comparison between sequences)
  • Some entries based on predictions (accession
    XM_ XR_ XP_ ZP_)
  • Currently, 8'665 species represented
  • Annotation
  • Manual annotation (only in entries tagged as
    "reviewed")
  • Collaboration
  • Propagation from other sources
  • Computation.

56
RefSeq (Reference Sequence)
CURATION
GENOME ANNOTATION No
INFERRED No
MODEL No
PREDICTED No
PROVISIONAL No
REVIEWED Yes (sequence functional information and features)
VALIDATED Yes (initial sequence)
WGS No
57
RefSeq entry NM_015595 SGEF mRNA
Accession number Definition Taxonomy List of
references
58
RefSeq entry NM_015595 SGEF mRNA
Gene name Exon annotation CDS annotation and
sequence
59
RefSeq entry NM_015595 SGEF mRNA
Sequence
60
Searching with the gene name
61
(No Transcript)
62
Refseq
63
  • NCBI Entrez system
  • Looks for the request in all NCBI databases
  • Cannot be ignored -gt no simple way to search
    only in your favourite NCBI database

64
Searching using BLAST
65
(No Transcript)
66
(No Transcript)
67
(No Transcript)
68
(No Transcript)
69
(No Transcript)
70
(No Transcript)
71
(No Transcript)
72
(No Transcript)
73
UniSTS62643 maps to multiple loci in Homo
sapiens
74
UniGene
Mapping of known genes
75
Mapping of RNA (EMBL/GenBank/DDBJ RefSeq)
UniGene
Mapping of known genes
76
Mapping of RNA (EMBL/GenBank/DDBJ RefSeq)
UniGene
Mapping of RefSeq RNA
Mapping of known genes
77
Mapping of RNA (EMBL/GenBank/DDBJ RefSeq)
UniGene
Mapping of RefSeq RNA
Mapping of known genes
This view by default can be customized
78
1. Choose desired option 2. Add it (and/remove
undesired) 3. Apply the new display
79
(No Transcript)
80
(No Transcript)
81
Map viewer 110 organisms represented in Genome
database.
(www.ncbi.nlm.nih.gov/sites/entrez?dbgenome)
82
Genomic tools on the UCSC server BLAT search
83
(No Transcript)
84
Genome browser _at_ UCSC
Feb. 2009 assembly not all data implemented
! May be better to use former assembly for the
time being.
http//genome.ucsc.edu/cgi-bin/hgBlat
85
(No Transcript)
86
(No Transcript)
87
Chromosomal location
gDNA sequence
Consensus CDS other sequences from reliable
resources
88
Annotation of genes is provided by multiple
public resources, using different methods, and
resulting in information that is similar but not
always identical. CCDS database goal provide
a standard set of gene annotations.
Collaborative project involving teams (manual
and automated annotation) European
Bioinformatics Institute (EBI) National
Center for Biotechnology Information (NCBI)
Wellcome Trust Sanger Institute (WTSI)
University of California, Santa Cruz (UCSC)
Currently available only for human and mouse
genomes (July 2009) 20'159 human CCDS (including
isoforms) -gt 17'054 CCDS genes 17'707 mouse CCDS
(including isoforms) -gt 16'889 CCDS genes
http//www.ncbi.nlm.nih.gov/projects/CCDS/CcdsBrow
se.cgi
89
Chromosomal location
gDNA sequence
Consensus CDS other sequences from reliable
resources
All sequences can be retrieved
(Human) mRNAs
(Human) spliced ESTs
(Human) ESTs (including unspliced)
90
The view can be completely customized
91
including with various tools allowing
comparative genomics
92
http//genome.ucsc.edu/
and including your own data !
93
Back to the Blat viewer
94
Arrows gtgtgtgt show the direction of transcription
95
2 transcripts from the same locus BDNF
(Brain-Derived Neurotrophic Factor) BDNFOS (BDNF
Opposite Strand)
96
(No Transcript)
97
View of alternative exons
Alternative exons
98
Interested by this exon ?
Just zoom in
99
(No Transcript)
100
Genome browser _at_ UCSC has many great options,
give it a try! http//genome.ucsc.edu/
101
Typical problems or Why wonderful tools will
never replace the brain of a life scientist !
102
(No Transcript)
103
Once upon a time, there was a gene on
chromosome 11
104
2 essential genome resources are missing from
this lecture Ensembl (http//www.ensembl.org/ind
ex.html) automated annotation of many
genomes Vega (http//vega.sanger.ac.uk/index.htm
l) High quality manual annotation of genomes
(currently Homo sapiens, Mus musculus, Danio
rerio, Gorilla gorilla, Macropus eugenii, Sus
scrofa, Canis familiaris). Please go and visit
them!
105
The flow of information From DNA
sequences to protein sequences A little
biology and A few databases
106
From genome to proteomethe example of human
Proteome
Genome
Ê
1'000'000 human proteins
20500 human protein-encoding genes
Post-translational modifications (PTMs) Most
PTMs cannot be predicted from DNA sequences
Alternative promoter usage Alternative
splicing Trans-splicing mRNA editing
Increase in complexity 5-10 x
Transcriptome
107
The hectic life of a protein sequence
Data not submitted to public databases, delayed
or cancelled
cDNAs, ESTs, genomes,
Nucleic acid databases
DDBJ
EMBL
GenBank
International Nucleotide Sequence Database
Collaboration
www.insdc.org
108
!!!! 99 of the protein sequences found in
databases come from the translation nucleotide
sequences gt Experimental evidence may be
lacking!
109
EMBL (DNA)
A similar pipeline is used at the NCBI to go from
GenBank to GenPept
110
!!!! The quality of UniProtKB/TrEMBL ( GenPept)
entries depends upon the quality of the
submissions in the original EMBL-Bank/GenBank/DDBJ
entry.
111
(No Transcript)
112
(No Transcript)
113
EMBL (DNA)
114
Splice variants
Sequence
Sequence features
Ontologies
Annotations
References
Nomenclature
115
Evidence for protein existence Annotation in
UniProtKB
5 levels of evidence 1. evidence at protein
level, 2. evidence at transcript level, 3.
inferred by homology, 4. predicted, 5. uncertain.
116
http//www.uniprot.org/uniprot/P35613
117
(No Transcript)
118
http//www.uniprot.org/uniprot/Q9Y471
119
http//www.uniprot.org/uniprot/Q9Y471
120
Family and domain dbs Gene3D HAMAP InterPro PANTHE
R Pfam PIRSF PRINTS ProDom PROSITE SMART TIGRFAMs
Organism-specific dbs AGD BuruList CGD CTD CYGD
DictyBase EchoBASE EcoGene euHCVdb FlyBase GenAtl
as GeneCards GeneDB_Spombe GeneFarm Gramene H-InvD
B HGNC HPA LegioList Leproma ListiList MaizeGDB
MGI MIM MypuList Orphanet PharmGKB PhotoList Pseu
doCAP RGD SagaList SGD SubtiList TAIR TubercuList
WormBase WormPep Xenbase ZFIN
Genome annotation dbs Ensembl GeneID GenomeReviews
KEGG NMPDR TIGR UCSC VectorBase
Sequence dbs EMBL IPI PIR UniGene RefSeq
Proteomic dbs PeptideAtlas PRIDE ProMEX
Phylogenomic dbs HOGENOM HOVERGEN OMA
Gene expression dbs ArrayExpress Bgee CleanEx Germ
Online
Polymorphism dbs dbSNP
UniProtKB/Swiss-Prot 115 explicit links
2D-gel dbs 2DBase-Ecoli ANU-2DPAGE Aarhus/Ghent-2
DPAGE (no server) COMPLUYEAST-2DPAGE Cornea-2DPAGE
DOSAC-COBS-2DPAGE ECO2DBASE (no
server) HSC-2DPAGE OGP PHCI-2DPAGE PMMA-2DPAGE Rat
-heart-2DPAGE REPRODUCTION-2DPAGE Siena-2DPAGE SWI
SS-2DPAGE World-2DPAGE
and 19 implicit links!
Ontologies GO
Protein family/group dbs CAZy MEROPS PeroxiBase Pp
taseDB REBASE TCDB
3D structure dbs DisProt HSSP PDB PDBsum SMR
Enzyme and pathway dbs BioCyc BRENDA Pathway_Inter
action_DB Reactome
Others BindingDB PMAP-CutDB DrugBank NextBio
PTM dbs GlycoSuiteDB PhosphoSite PhosSite
Protein-protein interaction dbs DIP IntAct
121
(No Transcript)
122
The UniProt consortium
123
UniProt mission Provide a comprehensive
high-quality and freely accessible resource of
protein sequence and functional annotation.
124
(No Transcript)
125
Update frequencyA crucial issue !!
  • Sometimes very difficult, or even impossible, to
    find
  • Crucial not only for the database itself, but
    also for tools using databases.

126
Update frequency
127
(No Transcript)
128
http//www.matrixscience.com/search_intro.html
129
Mascot MS/MS identification tool is fine, but it
cannot be used from this website ! Solution
Download the database of interest and make sure
you work with an up-to-date version.
130
Never hesitate to ask for an update
131
(No Transcript)
132
UniProtKB protein sequence knowledgebase, 2
sections UniProtKB/Swiss-Prot and
UniProtKB/TrEMBL (query, Blast, download)
(9232223 entries) UniParc protein sequence
archive (equivalent to EMBL-Bank/GenBank/DDBJ at
the protein level). Each entry contains a protein
sequence with cross-links to other databases
where you find the sequence (active or not). Not
annotated. (query, no Blast on www.uniprot.org,
Blast _at_ EBI, not downloadable) (20070606
entries)
133
UniParc entry contains all records for a unique
sequence in major publicly available databases.
134
UniProtKB protein sequence knowledgebase, 2
sections UniProtKB/Swiss-Prot and
UniProtKB/TrEMBL (query, Blast, download)
(9232223 entries) UniParc protein sequence
archive (EMBL equivalent at the protein level).
Each entry contains a protein sequence with
cross-links to other databases where you find the
sequence (active or not). Not annotated. (query,
no Blast on www.uniprot.org, Blast _at_ EBI, not
downloadable) (20070606 entries) UniRef 3
clusters of protein sequences with 100, 90 and 50
similarity useful to speed up sequence
similarity search (BLAST) (query, Blast,
download) (UniRef100 8474689 entries UniRef90
5668'669 entries UniRef50 2'729'565 entries)
135
UniRef100, 90 and 50
  • One UniRef100 entry -gt merge of identical
    sequences (including subfragments, splice
    variants). Based on UniProtKB sequences and
    selected UniParc records (such as Ensembl
    RefSeq).
  • One UniRef90 entry -gt sequences that have at
    least 90 or more identity. Built from UniRef100.
  • One UniRef50 entry -gt sequences that are at least
    50 identical. Built from UniRef100.

136
(No Transcript)
137
UniProtKB protein sequence knowledgebase, 2
sections UniProtKB/Swiss-Prot and
UniProtKB/TrEMBL (query, Blast, download)
(7097874 entries) UniParc protein sequence
archive (EMBL equivalent at the protein level).
Each entry contains a protein sequence with
cross-links to other databases where you find the
sequence (active or not). Not annotated. (query,
no Blast on www.uniprot.org, Blast _at_ EBI, not
downloadable) (17646564 entries) UniRef 3
clusters of protein sequences with 100, 90 and 50
similarity useful to speed up sequence
similarity search (BLAST) (query, Blast,
download) (UniRef100 6,652,983 entries UniRef90
4438653 entries UniRef50 2104702
entries) UniMES protein sequences derived from
metagenomic projects (Global Ocean Sampling
(GOS)) (Blast, download) (UniMes 6'028'191
entries)
138
What is "Non-Redundancy" ?
  • UniParc
  • One UniParc entry for all entries corresponding
    to 100 identical sequences (100 identity over
    the entire length) (from many different
    databases).
  • UniRef
  • One UniRef100 entry for all entries corresponding
    to 100 identical sequences (including fragments)
    from UniProtKB, Ensembl, Refseq, PDB.
  • UniProtKB/Swiss-Prot
  • One Swiss-Prot entry for all the protein products
    of one gene, including fragments,
    variations/polymorphisms, splice variants,
    sequencing errors

139
Comparing searches NCBI and UniProt
140
Search for the human Toll-like receptor 4 Entrez
Protein (NCBI)
141
Search for the human Toll-like receptor 4 in
UniProtKB
Swiss-Prot
142
Sequences retrieved in Entrez Protein O00206 AAF0
5316 CAH72618 CAH72619 BAG55035 AAI17423
AAF89753 NP_612564 AAC34135 Based on
A126770, BC117422,AL160272 and AA598398
143
Major protein sequence resources
Resources integrated in the entries
PIR
PDB
PRF
UniProtKB Swiss-Prot TrEMBL EntrezProtein
Swiss-ProtGenPeptPIRPDBPRFRefSeq
Resources integrated in the search engine
UniProtKB/Swiss-Prot manually annotated protein
sequences (12000 species) UniProtKB/TrEMBL
submitted CDS (EMBL) automated annotation
(202000 species) GenPept submitted CDS
(GenBank) PIR Protein Information Ressource
archive since 2003 integrated into
UniProtKB PDB Protein Databank 3D data and
associated sequences PRF journal scan of
published peptide sequences RefSeq Reference
Sequence for DNA, RNA, protein gene prediction
some manual annotation
144
Model Organism Databases (MODs) at a glance
145
Model organism Species extensively studied to
understand particular biological phenomena, with
the expectation that discoveries made in the
organism model will provide insight into the
workings of other organisms.
Model organisms MODs Mus musculus MGI
http//www.informatics.jax.org/ Rattus
norvegicus RGD http//rgd.mcw.edu/ Oryza
sativa RAP-DB http//rapdb.dna.affrc.go.jp/ Ara
bidopsis thaliana TAIR http//www.arabidopsis.or
g/ Drosophila melanogaster FlyBase
http//flybase.org/ Schizosaccharomyces pombe S.
pombe GeneDB http//www.genedb.org/genedb/pombe/ S
accharomyces cerevisiae SGD http//www.yeastgenome
.org/ Caenorhabditis elegans WormBase
http//www.wormbase.org/
Dictyostelium discoideum dictyBase
http//dictybase.org/
Bacillus subtilis SubtiList
http//genolist.pasteur.fr/SubtiList/
Escherichia coli ecogene http//ecogene.org/
Danio rerio (zebrafish) ZFIN http//zfin.org/
Just a few examples, not an exhaustive
list!
Methanocaldococcus jannaschii -gt no MOD

146
Model organism databases (MODs) Genome
annotation Gene models Gene mapping Official
nomenclature Gene expression Functional
annotation Interactions Information about
mutants/knockout/transgenic animals Phenotypes (
cross-)references Species-specific
reagents Key resources for information on a
given organism Service provided to/from a given
community
147
(No Transcript)
148
(No Transcript)
149
(No Transcript)
150
(No Transcript)
151
(No Transcript)
152
(No Transcript)
153
http//gmod.org/wiki/Main_Page
154
The world of databases is a jungle
155
  • A few points to remember
  • when using databases
  • Content
  • - Primary / secondary / meta-databases
  • - Curated / non-curated
  • - manual / automated curation
  • - Redundant / non-redundant.
  • Update frequency
  • Stable identifiers
  • Strategy
  • Dataflow
  • Collaborations between databases.

156
Test a few genomic databases and tools
157
Genomes and genomic tools a few sites
NCBI http//www.ncbi.nlm.nih.gov/sites/entrez?db
genome EBI http//www.ebi.ac.uk/genomes/ TIGR
http//cmr.jcvi.org/tigr-scripts/CMR/shared/Genom
es.cgi Genome annotation and analysis
tools http//www.ensembl.org/index.html http//ve
ga.sanger.ac.uk/index.html http//genome.ucsc.edu/
-gt BLAT, Galaxy, Custom tracks,
http//www.jgi.doe.gov/software/ -gt Genome
portal, Integrated Microbial Genomes (IMG) and
other tools Generic Model Organism Database
http//gmod.org/wiki/Main_Page
158
Genomes and genomic tools Hands-on
Find your favorite (completely sequenced)
organism in a genome db Follow the links to see
the options on different sites Find the
sequences Look at the annotation of your
favorite gene Compare the entries corresponding
to this gene across sites Test search engines
(restrict searches, compare results, ) Whenever
possible use on-line tutorials, such
as http//www.ensembl.org/info/website/tutorials/
index.html Visit GMOD, see the tools
(http//gmod.org/wiki/GMOD_Components) Play
around with the BLAT search, customize display,
follow the links,
159
Genomes and genomic tools Hands-on
Go and visit databases cited in this
lecture The databases/tools that should be
"familiar" to all are http//genome.ucsc.edu/cgi-
bin/hgBlat http//www.ensembl.org/index.html gene/
genome databases/tools on http//www.ncbi
.nlm.nih.gov/ If none of the databases are of
interest for you, go to the NAR database
(http//www.oxfordjournals.org/nar/database/a/)
and find databases that are closest to your
interests Play around Hands on protein
sequence databases and UniProt http//education.e
xpasy.org/cours/HK09/Protein_database_TP.html (cor
rections http//education.expasy.org/cours/HK09/P
rotein_database_TP_correction.html)
160
Thank You !
Write a Comment
User Comments (0)
About PowerShow.com