Molecular Biology Databases - PowerPoint PPT Presentation

About This Presentation
Title:

Molecular Biology Databases

Description:

A database can be defined as 'a collection of data arranged for ease and speed ... SOURCE Atlantic horseshoe crab. ORGANISM Limulus polyphemus ... – PowerPoint PPT presentation

Number of Views:355
Avg rating:3.0/5.0
Slides: 82
Provided by: Mur2
Category:

less

Transcript and Presenter's Notes

Title: Molecular Biology Databases


1
Molecular Biology Databases
  • NCBI, DDBL, EMBL and others

2
What is a Database?
  • A database can be defined as "a collection of
    data arranged for ease and speed of search and
    retrieval.
  • A DNA database contains individual records or
    data entries of the DNA sequences as well as
    information about the sequences.
  • A DNA database often contains flat-files. These
    are relatively simple database systems in which
    each database is contained in a single table.
  • In contrast, relational database systems can use
    multiple tables to store information, and each
    table can have a different record format.

3
GenBank as a Database
  • GenBank is the National Institute of Health (NIH)
    genetic sequence database, an annotated
    collection of all publicly available DNA
    sequences.
  • It is maintained by the National Center for
    Biotechnology Information (NCBI) within the
    National Institute of Health (NIH).

4
Anatomy of a Genome InfoSystem
  • Information structure
  • Records of hierarchical, complex documents
    Tables of rows and columns of numbers, letters,
    words
  • Table of contents, Reports, Indexing (as a
    reference book)
  • Browse thru available structure.
  • Search and retrieve according to biological
    questions
  • Bulk data selection retrieval for other uses
  • Information content
  • Primary Literature (referenced, abstracted
    and curated), Sequence and feature analyses,
    maps, controlled vocabulary/ontologies relevant
    to biology, people, research methods, contacts,
    etc.
  • Metadata describing primary data, along with
    protocols, notes, sources
  • Informatics / software
  • Back-end database, data collection,
    management, with some analyses
  • Front-end information services (hypertext
    web, document search/retrieval methods) ease of
    understanding and usage (HCI)
  • Middleware glue code, software, etc.
  • Specialized application for genome data maps,
    BLAST searches, ontologies

5
History of Sequence Databases
  • The first bioinformatics databases were
    constructed a few years after the first protein
    sequences began to become available.
  • The first protein sequence reported was that of
    bovine insulin in 1956, consisting of 51
    residues.
  • Nearly a decade later, the first nucleic acid
    sequence was reported, that of yeast alanine tRNA
    with 77 bases.
  • Just a year later, Dayhoff gathered all the
    available sequence data to create the first
    bioinformatic database.
  • The Protein DataBank followed in 1972 with a
    collection of ten X-ray crystallographic protein
    structures, and the SWISSPROT protein sequence
    database began in 1987.

6
GenBank History
  • DNA databases began in the early 1980s with a
    database called GenBank, which was originated by
    the U.S. Department of Energy to hold the short
    stretches of DNA sequence that scientists were
    just beginning to obtain from a range of
    organisms.
  • In the early days of GenBank, rooms of
    technicians sat at keyboards consisting of only
    the four letters A, C, T and G, tediously
    entering the DNA-sequence information published
    in academic journals.

7
The National Center for Biotechnology Information
  • Created as a part of NLM in 1988
  • Establish public databases
  • U.S. National DNA Sequence Database
  • Perform research in computational biology
  • Develop software tools for sequence analysis
  • Disseminate biomedical information

8
GenBank History
  • Newer communication technologies enabled
    researchers to dial up GenBank and dump in their
    sequence data directly.
  • The administration of GenBank was transferred to
    National Institutes of Health's National Center
    for Biotechnology Information (NCBI).
  • With the advent of the World Wide Web,
    researchers could access the data in GenBank for
    free from around the globe.
  • Once the Human Genome Project (HGP) began in
    1990, DNA-sequence data in GenBank began to grow
    exponentially.
  • With the introduction in the 1990s of
    high-throughput sequencing additions to GenBank
    skyrocketed.

9
  • An Interesting Metaphor
  • For Bioinformatics Information Flow and Databases
  • Cooks generate and enter the data.
  • Data Management makes it into a stew of blended
    information.
  • The waiters take the data from the servers to the
    public.
  • The diners are placing orders for the information
    they wish to consume.

10
(No Transcript)
11
Molecular Databases
  • Primary Databases
  • Original submissions by experimentalists
  • Database staff organize but dont add additional
    information
  • Example GenBank,SNP, GEO
  • Derivative Databases
  • Human curated
  • compilation and correction of data
  • Example SWISS-PROT, NCBI RefSeq mRNA
  • Computationally Derived
  • Example UniGene
  • Combinations
  • Example NCBI Genome Assembly

12
What, the scientists submit their own DNA
sequences?
  • Who checks for error?
  • Who makes people actually send their data to the
    database so all can share it?
  • Learn from success, failure of GenBank/EMBL
    extensive publicly shared bio-data
  • Carrot/stick approach. Granting agencies and
    journals began requiring scientists to publish
    sequence data. Patented sequences must be
    entered in the databases too.
  • However, there is significant public databank
    error due to data ownership by scientists no
    inducements to update or go back and correct
    errors.

13
(No Transcript)
14
GenBank is NCBIs Primary Sequence Database
  • Nucleotide only sequence database
  • Archival in nature
  • GenBank Data
  • Direct submissions (traditional records )
  • Batch submissions (EST, GSS, STS)
  • ftp accounts (genome data)
  • Three collaborating databases
  • GenBank
  • DNA Database of Japan (DDBJ)
  • European Molecular Biology Laboratory (EMBL)
    Database

15
Why use Bioinformatics Databases?
  • Speed of information retrieval
  • Increasing size of data sets
  • Amount of information available
  • Save time and money by simulating experiments
    prior to actual experiment (a.k.a. in silico)

16
How do you access Databases?
  • Search engines
  • Programs that allow you to search the database
  • Links from other sites to the search engines
  • Programs that directly link to the search engines

17
Boolean Logic
  • Why do we use Boolean operators
  • To narrow your search
  • get fewer superfluous results
  • What are the Boolean Operators
  • AND-looks for entries with both terms
  • OR-looks for entries with one term or the other
  • NOT (or BUTNOT)-looks for entries with one term
    but not the other
  • (Wildcard) -looks for ALL entries that contain
    the term with the after it

18
AND
Allergy
Food
Citations that contain the descriptors Food AND
Allergy only.
19
OR
Citations that contain the descriptors Food OR
Allergy. This is a bigger set.
20
NOT
Citations that contain the descriptors Allergy
NOT Food
21
(Wildcard)
Allerg
Food
Citations that contain the descriptors Allerg
(Allergies, Allergy, Allergen
22
GenBank as a Database
  • GenBank identifiers are unique combination of
    numbers and letters used to index GenBank
    sequence entries.
  • They can be used to retrieve information about a
    particular gene or DNA sequence from the GenBank
    database.
  • This information also includes links to similar
    sequence entries and other public databases,
    making it a relational database as well as a flat
    file database.

23
What is GenBank? NCBIs Primary Sequence Database
  • Nucleotide only sequence database
  • Archival in nature
  • GenBank Data
  • Direct submissions individual records (BankIt,
    Sequin)
  • Batch submissions via email (EST, GSS, STS)
  • ftp accounts sequencing centers
  • Data shared three collaborating databases
  • GenBank
  • DNA Database of Japan (DDBJ).
  • European Molecular Biology Laboratory Database
  • (EMBL) at EBI.

24
The International Sequence Database Collaboration
25
GenBank NCBIs Primary Sequence Database
83.65 Gigabytes of data
26
GenBank NCBIs Primary Sequence Database
114 Gigabytes
27
GenBank NCBIs Primary Sequence Database
  • full release every two months
  • incremental and cumulative updates daily
  • available only through internet

ftp//ftp.ncbi.nih.gov/genbank/
28
The Growth of GenBank
Release 139 31.0 million records 36.6
billion nucleotides Average doubling time 12
months
Sequence Records (millions)
Total Base Pairs (billions)
29
The Entrez System
30
Entrez Nucleotides
  • Primary
  • GenBank / EMBL / DDBJ 35,116,960
  • Derivative
  • RefSeq 259,219
  • Third Party Annotation 3,182
  • PDB 4,703

  • Total
    35,384,248

31
Entrez Protein
  • GenPept (GB,EMBL, DDBJ) 3,442,298
  • RefSeq 856,191
  • Third Party Annotation 3,834
  • Swiss Prot 144,508
  • PIR 282,821
  • PRF
    12,079

  • Total
    3,442,298
  • BLAST nr
    1,642,191

32
Organization of GenBankGenBank Divisions
  • Records are divided into 17 Divisions.
  • 1 Patent (11 files)
  • 5 High Throughput
  • 11 Traditional

EST (288) Expressed Sequence Tag GSS (98)
Genome Survey Sequence HTG (61) High Throughput
Genomic STS (3) Sequence Tagged Site HTC (3)
High Throughput cDNA
PRI (27) Primate PLN (10) Plant and
Fungal BCT (8) Bacterial and Archeal INV
(6) Invertebrate ROD (11) Rodent VRL (3)
Viral VRT (4) Other Vertebrate MAM (1)
Mammalian (ex. ROD and PRI) PHG (1) Phage SYN
(1) Synthetic (cloning vectors) UNA (1)
Unannotated
  • Traditional Divisions
  • Direct Submissions
  • (Sequin and BankIt)
  • Accurate
  • Well characterized
  • BULK Divisions
  • Batch Submission
  • (Email and FTP)
  • Inaccurate
  • Poorly characterized

Entrez query gbdiv_xxxProperties
33
Traditional GenBank Divisions
  • Direct Submissions (Sequin and BankIt)
  • Accurate
  • Well characterized

BCT Bacterial and Archeal INV Invertebrate MAM Ma
mmalian (ex. ROD and PRI) PHG Phage PLN Plant and
Fungal PRI Primate ROD Rodent SYN Synthetic
(cloning vectors) VRL Viral VRT Other Vertebrate
34
A Helpful Resource
  • This is a link to a sample annotated GenBank
    Record. Click on any of the underlined links to
    learn more about the file structure.
  • http//www.ncbi.nlm.nih.gov/Sitemap/samplerecord.h
    tml

35
What is an Accession Number?
  • An accession number is label that used to
    identify a sequence in the various databases. It
    is a string of letters and/or numbers that
    corresponds to a molecular sequence.
  • Examples (all for retinol-binding protein, RBP4)
  • X02775 GenBank genomic DNA sequence
  • NT_030059 Genomic contig
  • Rs7079946 dbSNP (single nucleotide polymorphism)
  • N91759.1 An expressed sequence tag (1 of 170)
  • NM_006744 RefSeq DNA sequence (from a transcript)
  • NP_007635 RefSeq protein
  • AAC02945 GenBank protein
  • Q28369 SwissProt protein
  • 1KT7 Protein Data Bank structure record

36
GenBank Flat File Format
  • When you click on an entry, you have opened a
    GenBank Flat File
  • Information includes
  • The Name of the gene
  • The Accession number
  • Journal articles

37
GenBank Flat File Format
  • Information (Cont)
  • Structural information of the gene (eg
    intron/exon boundaries, promoters,etc)
  • The code for the protein
  • The code for the DNA (RNA-if mRNA it is the cDNA
    for the mRNA sequenced)

38
A Traditional GenBank Record
LOCUS AF062069 3808 bp mRNA
INV 02-MAR-2000 DEFINITION Limulus
polyphemus myosin III mRNA, complete
cds. ACCESSION AF062069 VERSION AF062069.2
GI7144484 KEYWORDS . SOURCE Atlantic
horseshoe crab. ORGANISM Limulus polyphemus
Eukaryota Metazoa Arthropoda
Chelicerata Merostomata Xiphosura
Limulidae Limulus. REFERENCE 1 (bases 1 to
3808) AUTHORS Battelle,B.-A., Andrews,A.W.,
Calman,B.G., Sellers,J.R.,
Greenberg,R.M. and Smith,W.C. TITLE A
myosin III from Limulus eyes is a clock-regulated
phosphoprotein JOURNAL J. Neurosci. (1998) In
press REFERENCE 2 (bases 1 to 3808) AUTHORS
Battelle,B.-A., Andrews,A.W., Calman,B.G.,
Sellers,J.R., Greenberg,R.M. and
Smith,W.C. TITLE Direct Submission
JOURNAL Submitted (29-APR-1998) Whitney
Laboratory, University of Florida,
9505 Ocean Shore Blvd., St. Augustine, FL 32086,
USA REFERENCE 3 (bases 1 to 3808) AUTHORS
Battelle,B.-A., Andrews,A.W., Calman,B.G.,
Sellers,J.R., Greenberg,R.M. and
Smith,W.C. TITLE Direct Submission
JOURNAL Submitted (02-MAR-2000) Whitney
Laboratory, University of Florida,
9505 Ocean Shore Blvd., St. Augustine, FL 32086,
USA REMARK Sequence update by
submitter COMMENT On Mar 2, 2000 this
sequence version replaced gi3132700.
Definition Title
NCBIs Taxonomy
39
GenBank Record Feature Table
FEATURES Location/Qualifiers
source 1..3808
/organism"Limulus polyphemus"
/db_xref"taxon6850"
/tissue_type"lateral eye" CDS
258..3302 /note"N-terminal
protein kinase domain C-terminal myosin
heavy chain head substrate for PKA"
/codon_start1
/product"myosin III"
/protein_id"AAC16332.2"
/db_xref"GI7144485"
/translation"MEYKCISEHLPFETLPDPGDRFEVQELVGTGTYATV
YSAIDKQA NKKVALKIIGHIAENLLDIET
EYRIYKAVNGIQFFPEFRGAFFKRGERESDNEVWLGI
EFLEEGTAADLLATHRRFGIHLKEDLIALIIKEVVRAVQYLHE
NSIIHRDIRAANIMF
SKEGYVKLIDFGLSASVKNTNGKAQSSVGSPYWMAPEVISCDCLQEPYNY
TCDVWSIG ITAIELADTVPSLSDIHALRAM
FRINRNPPPSVKRETRWSETLKDFISECLVKNPEYR
PCIQEIPQHPFLAQVEGKEDQLRSELVDILKKNPGEKLRNKPYN
VTFKNGHLKTISGQ BASE COUNT 1201 a 689 c
782 g 1136 t ORIGIN 1 tcgacatctg
tggtcgcttt ttttagtaat aaaaaattgt attatgacgt
cctatctgtt 3781 aagatacagt aactagggaa
aaaaaaaa //
40
Multiple Formats are available for Sequence Data
  • Historically, all the DNA and Protein software
    was written concurrent with the establishment of
    the databases. So the formats needed in the
    databases and the software co-evolved.
  • Sequence analysis software needs simpler formats
    than databases for speed- or else the program
    must be allowed to ignore most of the excess
    information.

41
FastA format is a very popular solution
gtgi603218gbU18238.1MSU18238 Medicago sativa
glucose-6-phosphate dehyd CCACCAGATATAATTAAGTAGATC
AGAGTAGAAGAAGATGGGAACAAATGAATGGCATGTAGAAAGAAGA GAT
AGCATAGGTACTGAATCTCCTGTAGCAAGAGAGGTACTTGAAACTGGCAC
ACTCTCTATTGTTGTGC TTGGTGCTTCTGGTGATCTTGCCAAGAAGAAG
ACTTTTCCTGCACTTTTTCACTTATATAAACAGGAATT GTTGCCACCTG
ATGAAGTTCACATTTTTGGCTATGCAAGGTCAAAGATCTCCGATGATGAA
TTGAGAAAC AAATTGCGTAGCTATCTTGTTCCAGAGAAAGGTGCTTCTC
CTAAACAGTTAGATGATGTATCAAAGTTTT TACAATTGGTTAAATATGT
AAGTGGCCCTTATGATTCTGAAGATGGATTTCGCTTGTTGGATAAAGAGA
T TTCAGAGCATGAATATTTGAAAAATAGTAAAGAGGGTTCATCTCGGAG
GCTTTTCTATCTTGCACTTCCT CCTTCAGTGTATCCATCCGTTTGCAAG
ATGATCAAAACTTGTTGCATGAATAAATCTGATCTTGGTGGAT GGACAC
GCGTTGTTGTTGAGAAACCCTTTGGTAGGGATCTAGAATCTGCAGAAGAA
CTCAGTACTCAGAT TGGAGAGTTATTTGAAGAACCACAGATTTATCGTA
TTGATCACTATTTAGGAAAGGAACTAGTGCAAAAC ATGTTAGTACTTCG
TTTTGCAAATCGGTTCTTCTTGCCTCTGTGGAACCACAACCACATTGACA
ATGTGC AGATAGTATTTAGAGAGGATTTTGGAACTGATGGTCGTGGTGG
ATATTTTGACCAATATGGAATTATCCG AGATATCATTCCAAACCATCTG
TTGCAGGTTCTTTGCTTGATTGCTATGGAAAAACCCGTTTCTCTCAAG C
CTGAGCACATTCGAGATGAGAAAGTGAAGGTTCTTGAATCAGTACTCCCT
ATTAGAGATGATGAAGTTG TTCTTGGACAATATGAAGGCTATACAGATG
ACCCAACTGTACCGGACGATTCAAACACCCCGACTTTTGC AACTACTAT
TCTGCGGATACACAATGAAAGATGGGAAGGTGTTCCTTTCATTGTGAAAG
CAGGGAAGGCC CTAAATTCTAGGAAGGCAGAGATTCGGGTTCAATTCAA
GGATGTTCCTGGTGACATTTTCAGGAGTAAAA AGCAAGGGAGAAACGAG
TTTGTTATCCGCCTACAACCTTCAGAAGCTATTTACATGAAGCTTACGGT
CAA GCAACCTGGACTGGAAATGTCTGCAGTTCAAAGTGAACTAGACTTG
TCATATGGGCAACGATATCAAGGG ATAACCATTCCAGAGGCTTATGAGC
GTCTAATTCTCGACACAATTAGAGGTGATCAACAACATTTTGTTC GCAG
AGACGAATTAAAGGCATCATGGCAAATATTCACACCACTTTTACACAAAA
TTGATAGAGGGGAGTT GAAGCCGGTTCCTTACAACCCGGGAAGTAGAGG
TCCTGCAGAAGCAGATGAGTTATTAGAAAAAGCTGGA TATGTTCAAACA
CCCGGTTATATATGGATTCCTCCTACCTTATAGAGTGACCAAATTTCATA
ATAAAACA AGGATTAGGATTATCAGGAGCTTATAAATAAGTCTTCAATA
AGCTTGTGAAATTTTCGTTATAATCTCTC TCATTTTGGGGTGTATATCA
AGCATTTAAGCGCGTGTTTGACACAGTTTGTGTAATAGATTTGGCTCTGA
ATGAAAATAAACGGGAATTGTTTCTTTTTGTTTTA
gt
42
FASTA format
43
Graphics format
44
ASN.1 Format
  • ASN.1, or Abstract Syntax Notation One, is an
    International Standards Organization (ISO) data
    representation format used to achieve
    interoperability between platforms.
  • NCBI uses ASN.1 for the storage and retrieval of
    data such as nucleotide and protein sequences,
    structures, genomes, and MEDLINE records.
  • ASN.1 permits computers and software systems of
    all types to reliably exchange both the data
    structure and content.

45
NCBI Software Development Tool Kit
  • The "NCBI Toolbox" is a set of software and data
    exchange specifications used by NCBI to produce
    portable, modular software for molecular biology.
  • The software in the Toolbox is primarily designed
    to read ASN.1 format records.
  • It is available to the public in the
    toolbox/ncbi_tools directory of NCBI's ftp site,
    and can be used in its own right or as a
    foundation for building tools with similar
    properties.
  • The readme files in the toolbox and
    toolbox/ncbi_tools directories of the FTP site
    contain more information about the toolbox and
    ASN.1.

46
Abstract Syntax Notation ASN.1
Seq-entry set level 1 , class nuc-prot
, descr title "Medicago sativa
glucose-6-phosphate dehydrogenase mRNA, and
translated products" , source org
taxname "Medicago sativa subsp. sativa" ,
db db "taxon" ,
tag id 56147 ,
orgname name binomial
genus "Medicago" ,
species "sativa" , subspecies
"subsp. sativa" , mod

47
NCBI Toolbox
/
asn2ff.c
convert an ASN.1 entry to flat file format,
using the FFPrintArray.

/ include ltaccentr.hgt include
"asn2ff.h" include "asn2ffp.h" include
"ffprint.h" include ltsubutil.hgt include
ltobjall.hgt include ltobjcode.hgt include
ltlsqfetch.hgt include ltexplore.hgt ifdef
ENABLE_ID1 include ltaccid1.hgt endif FILE
fpl Args myargs "Filename for asn.1
input","stdin",NULL,NULL,TRUE,'a',ARG_FILE_IN,0.0,
0,NULL, "Input is a Seq-entry","F", NULL ,NULL
,TRUE,'e',ARG_BOOLEAN,0.0,0,NULL, "Input
asnfile in binary mode","F",NULL,NULL,TRUE,'b',ARG
_BOOLEAN,0.0,0,NULL, "Output
Filename","stdout", NULL,NULL,TRUE,'o',ARG_FILE_OU
T,0.0,0,NULL, "Show Sequence?","T", NULL ,NULL
,TRUE,'h',ARG_BOOLEAN,0.0,0,NULL,
48
Database Tools arent keeping pace
  • Despite the huge progress in sequencing and
    expression analysis technologies, and the
    corresponding magnitude of more data that is held
    in the public, private and commercial databases,
    the tools used for storage, retrieval, analysis
    and dissemination of data in bioinformatics are
    still very similar to the original systems
    gathered together by researchers 15-20 years ago.
  • Many are simple extensions of the original
    academic systems, which have served the needs of
    both academic and commercial users for many
    years.
  • These systems are now beginning to fall behind
    as they struggle to keep up with the pace of
    change in the pharma industry.

49
Database Tools arent keeping pace
  • Databases are still gathered, organized,
    disseminated and searched using flat files.
  • Relational databases are still few and far
    between, and object-relational or fully object
    oriented systems are rarer still in mainstream
    applications.
  • Interfaces still rely on command lines, fat
    client interfaces, which must be installed on
    every desktop, or HTML/CGI forms.
  • Whilst they were in the hands of bioinformatics
    specialists, pharmas have been relatively
    undemanding of their tools.
  • Now the problems have expanded to cover the
    mainstream discovery process, much more flexible
    and scalable solutions are needed to serve pharma
    RD informatics requirements.

50
There are more than one type of DNA sequence in
Genebank
  • Genomic sequences made from genomic DNA- these do
    contain introns and LOTS of DNA that never
    becomes messenger RNA. mRNA codes for proteins.
  • cDNA sequences made from mRNA- these dont
    contain the introns
  • ESTS (short stretches of cDNA sequences that are
    sort of a rough draft
  • mtDNA from mitochondrial genomes
  • SNP single nucleotide polymorphisms with some DNA
    variation.

51
Quality of the Sequence is Variable
  • Some of the DNA is sequenced several times before
    it is added to the databases.
  • Some of the DNA is sequenced very quickly on
    automated equipment and is input directly from
    the computers.
  • Both are important types of information.
  • The draft is corrected by curators who assemble
    the pieces into the genome.

52
Genome Sequencing
Whole BAC insert (or genome)
shredding
sequencing
cloning isolating
GSS division or trace archive
assembly
Draft Sequence (HTG division)
53
Working Draft Sequence
54
Assembly Required.
  • All the data is still in the pieces used to
    assemble the genomes.
  • So, that means all the overlapping pieces are
    still in the databases.
  • So, searching comes up with many versions and
    shorter subclones pieces which are used to
    assemble the genomic contigs or contiguous
    pieces which are assembled into whole
    chromosomes.
  • Sometimes you want to use the smaller pieces,
    since handling the whole chromosome is awkward in
    sequence analysis.

55
HTG Division High Throughput Genome
40,000 to gt 350,000 bp
56
HTG Division High Throughput Genome
40,000 to gt 350,000 bp
57
Whole Genome Shotgun
58
STS Division Sequence Tagged Sites
  • Segment of gene, EST , mRNA or genomic DNA
  • of known position (microsatellite)
  • PCR with STS primers gives one product per genome
  • Basis of Radiation Hybrid Mapping
  • UniGene
  • Genome Assembly
  • Related resource Electronic PCR
  • http//www.ncbi.nlm.nih.gov/genome/sts/epcr.c
    gi

59
Be aware of errors in the databases
  • Sequence errors
  • genome projects error rate is 1/10,000
    nucleotides
  • ESTs error rate is 1/100 nucleotides.
  • Annotation errors
  • Many databases annotate their sequences using
    automated computer programs. These programs do
    not always give correct annotations.
  • SwissProt is a protein database curated and
    annotated manually by biologists. Its regarded
    as the most reliable database, but does not have
    the most up-to-date sequence information.

60
There is a Lot of Sequence in the Databases
  • One problem is finding what you are looking for
    in the database.
  • Try putting in the search term human beta
    hemoglobin into the nucleotide database. It
    wont be easy to find the sequence in the 88
    pages of hits!
  • RefSeq was invented to help you find some of the
    common sequences based on a human (or now, a
    computer) looking over all the similar
    submissions of the same sequence to the database.
  • RefSeq corrects some of those sequence errors by
    comparing lots of sequences.

61
RefSeq NCBIs Derivative Sequence Database
  • Curated transcripts and proteins
  • reviewed
  • human, mouse, rat, fruit fly, zebrafish,
    arabidopsis, C. elegans
  • Human model transcripts and proteins
  • Assembled Genomic Regions (contigs)
  • draft human genome
  • mouse genome
  • Chromosome records
  • microbial
  • organelle

62
RefSeq Benefits
  • non-redundancy  
  • explicitly linked nucleotide and protein
    sequences
  • updates to reflect current sequence data and
    biology
  • data validation
  • format consistency
  • distinct accession series
  • stewardship by NCBI staff and collaborators

63
The RefSeq Accession Numbers
NCBI Reference Sequences mRNAs and
Proteins NM_123456 Curated mRNA NP_123456 Curated
Protein NR_123456 Curated non-coding
RNA XM_123456 Predicted Transcript (human,
mouse) XP_123456 Predicted Protein (human,
mouse) XR_123456 Predicted non-coding RNA Gene
Records NG_123456 Reference Genomic Sequence
(human) Assemblies NT_123456 Contig (Mouse and
Human) NW_123456 WGS Supercontig
(Mouse) NC_123455 Chromosome (Microbial,
Arabidopsis )
64
GenBank Sequences Human Lipoprotein Lipase
65
Curated RefSeq Records NM_, NP_
66
Alignment Based Models
67
Alignment Based Models
68
Alignment GeneratedTranscripts XM_,XP_
69
RefSeq Contig NT_, NW_
70
RefSeq Chromosomes NC_
LOCUS NC_002695 5498450 bp DNA
circular BCT 02-OCT-2001 DEFINITION
Escherichia coli O157H7, complete
genome. ACCESSION NC_002695 VERSION
NC_002695.1 GI15829254 KEYWORDS . SOURCE
Escherichia coli O157H7. ORGANISM
Escherichia coli O157H7 Bacteria
Proteobacteria gamma subdivision
Enterobacteriaceae
Escherichia. REFERENCE 1 (sites) AUTHORS
Makino,K., Yokoyama,K., Kubota,Y., Yutsudo,C.H.,
Kimura,S., Kurokawa,K., Ishii,K.,
Hattori,M., Tatsuno,I., Abe,H., Iida,T.,
Yamamoto,K., Ohnishi,M., Hayashi,T.,
Yasunaga,T., Honda,T., Sasakawa,C.
and Shinagawa,H. TITLE Complete nucleotide
sequence of the prophage VT2-Sakai carrying the
verotoxin 2 genes of the
enterohemorrhagic Escherichia coli O157H7
derived from the Sakai outbreak JOURNAL
Genes Genet. Syst. 74 (5), 227-239 (1999)
MEDLINE 20198780 PUBMED 10734605
71
Integrated WWW Access BLAST and Entrez
72
Some Web Statistics
July 2001
73
Users per day
1997 1998 1999 2000
2001
74
Bulk GenBank Divisions
  • Batch Submission and htg (email and ftp)
  • Inaccurate
  • Poorly Characterized

EST Expressed Sequence Tag STS Sequence Tagged
Site GSS Genome Survey Sequence HTG High
Throughput Genomic
75
EST Division Expressed Sequence Tags
gtIMAGE275615 5' mRNA sequence GACAGCATTCGGGCCGAGA
TGTCTCGCTCCGTGGCCTTAGCTGTGCTCGCGCTACTCTCTCTTTCTGGC
C TGGAGGTATCCAGCGTACTCCAAAGATTCAGGTTTACTCACGTCATCC
AGCAGAGAATGGAAAGTCAAAT TTCCTGAATTGCTATGTGTCTGGGTTT
CATCCATCCGACATTGAAGTTGACTTACTGAAGAATGGAGAGA GAATTG
AAAAAGTGGAGCATTCAGACTTGTCTTTCAGCAAGGACTGGTCTTTCTAT
CTCTTGTACTACAC TGAATTCACCCCCACTGAAAAAGATGAGTATGCCT
GCCGTGTTGAACCATGTNGACTTTGTCACAGNCCC AAGTTNAGTTTAAG
TGGGNATCGAGACATGTAAGGCAGGCATCATGGGAGGTTTTGAAGNATGC
CGCNTT TTGGATTGGGATGAATTCCAAATTTCTGGTTTGCTTGNTTTTT
TAATATTGGATATGCTTTTG
nucleus 30,000 genes
gatccantgccatacg
ctcgccaattcnntcg
gtIMAGE275615 3', mRNA sequence NNTCAAGTTTTATGATTT
ATTTAACTTGTGGAACAAAAATAAACCAGATTAACCACAACCATGCCTTA
CT TTATCAAATGTATAAGANGTAAATATGAATCTTATATGACAAAATGT
TTCATTCATTATAACAAATTTCC AATAATCCTGTCAATNATATTTCTAA
ATTTTCCCCCAAATTCTAAGCAGAGTATGTAAATTGGAAGTTAA CTTAT
GCACGCTTAACTATCTTAACAAGCTTTGAGTGCAAGAGATTGANGAGTTC
AAATCTGACCAAGAT GTTGATGTTGGATAAGAGAATTCTCTGCTCCCCA
CCTCTANGTTGCCAGCCCTC
  • - isolate unique clones
  • sequence once
  • from each end

RNA gene products
76
Unigene
  • A gene-oriented view of sequence entries
  • UniGene collects expressed sequence tags (ESTs)
    into clusters, in an attempt to form one gene per
    cluster.
  • Use UniGene to study where your gene is expressed
    in the body, when it is expressed, and see its
    abundance.

77
UniGene
  • MegaBlast based automated sequence clustering
  • Nonredundant set of gene oriented clusters
  • Each cluster a unique gene
  • Information on tissue types and map locations
  • Includes well-characterized genes and novel ESTs
  • Useful for gene discovery and selection of
    mapping reagents

http//www.ncbi.nlm.nih.gov/UniGene/
78
EST hits A.t. serine protease mRNA
A.t. mRNA
5 EST hits
3 EST hits
79
Arabidopsis UniGene Statistics
39,855 mRNAs gene CDSs 87,006
EST, 3'reads 42,137 EST, 5'reads
32,571 EST, other/unknown ----------
201,569 total sequences in clusters Final
Number of Clusters (sets)
sets total 25,474 sets
contain at least one known gene 17,654 sets
contain at least one EST 16,326 sets contain
both genes and ESTs
UniGene Build 14 Apr. 9th, 2002
26,808
115,000,000 bp 25,498 expected genes 5
uncharacterized transcripts
80
Hs UniGene Statistics
73,419 mRNAs gene CDSs 1,181,855
EST, 3'reads 1,461,928 EST, 5'reads
616,609 EST, other/unknown ----------
3,333,811 total sequences in clusters Final
Number of Clusters (sets)
sets total 22,431 sets
contain at least one known gene 97,618 sets
contain at least one EST 21,233 sets contain
both genes and ESTs
UniGene Build 148 Apr. 8th, 2002
98,816
3,000,000 base pairs 30 K expected genes 80
uncharacterized transcripts
81
UniGene Collections Jul, 2002
Sequences Clusters Homo sapiens human
3,569,546 101,602 Mus musculus mouse
2,332, 864 84,247 Rattus norvegicus rat
334,582 62,220 Danio rerio zebrafish
197,266 15,404 Bos taurus cow 128,914
10,295 Xenopus laevis frog 162,269
18,984 D.melanogaster fruit fly
250,655 11, 115 Anopholes gambiae mosquito
43,126 2,556 Plants Arabidopsis
thaliana thale cress 210,693 26,875 Oryzia
sativa rice 78,632 15,802 Triticum
aestivum wheat 139,447 12,575 Hordeum
vulgare barley 160,518 7,324 Zea
mays maize (corn) 131,668 10,301
Write a Comment
User Comments (0)
About PowerShow.com