Databases at NCBI presentation

About This Presentation

Transcript and Presenter's Notes

Title: Databases at NCBI

1
Databases at NCBI

Shiau, Cheng-Kai

2
Database

A database is a structured collection of records
or data that is stored in a computer system. The
structure is achieved by organizing the data
according to a database model. The model in most
common use today is the relational model. Other
models such as the hierarchical model and the
network model use a more explicit representation
of relationships.

http//en.wikipedia.org/wiki/Database
3
Database
4
Database
5
Database
6
Database
7
Database
8
Database
AmiGO
9
Database
AmiGO
10
Database
11
Database
AmiGO
12
Database

A database is a structured collection of records
or data that is stored in a computer system. The
structure is achieved by organizing the data
according to a database model. The model in most
common use today is the relational model. Other
models such as the hierarchical model and the
network model use a more explicit representation
of relationships.

http//en.wikipedia.org/wiki/Database
13
About NCBI

What does NCBI do?
Established in 1988 as a national resource for
molecular biology information, NCBI creates
public databases, conducts research in
computational biology, develops software tools
for analyzing genome data, and disseminates
biomedical information - all for the better
understanding of molecular processes affecting
human health and disease.

http//www.ncbi.nlm.nih.gov/
14
About NCBI
15
Databases at NCBI

Databases at NCBI
Literature databases
PubMed, PubMed Central, Books, OMIM
Molecular databases
Sequences
EST, STS, GSS, HTGS, HTC, FLIC, UniGene, RefSeq,
HomoloGene
Structures
MMDB, CDD,
Taxonomy
Other databases
GEO, SKY/CGH

16
Databases at NCBI
http//www.ncbi.nlm.nih.gov/
17
Databases at NCBI

Databases at NCBI
Literature databases
PubMed, PubMed Central, Books, OMIM
Molecular databases
Sequences
EST, STS, GSS, HTGS, HTC, FLIC, UniGene, RefSeq,
HomoloGene
Structures
MMDB, CDD,
Taxonomy
Other databases
GEO, SKY/CGH

18
Literature Databases

Literature databases
PubMed
PubMed Central
Books
OMIM

19
PubMed

PubMed
PubMed database was designed to provide access to
citations (with abstracts) from biomedical
journals.
Subsequently, a linking feature was added to
provide access to full-text journal articles at
web sites of participating publishers, as well as
to other related web resources.

20
PubMed

Data sources
MEDLINE
NLMs premier bibliographic databases covering
the fields of medicine, nursing, dentistry,
veterinary medicine, the health care system, and
the preclinical sciences, such as molecular
biology.
Non-MEDLINE
General science and chemistry journals that
contain life sciences indexed for MEDLINE, e.g.,
the plate tectonics or astrophysics articles from
Science magazine.
Other databases
HealthSTAR, AIDSLINE, HISTLINE, SPACELINE,
BIOETHICSLINE, and POPLINE.

21
PubMed

All electronic data are supplied via FTP to NCBI
in XML format, in accordance with the NLMs
specifications (document type definition, or
DTD).
XML extensible markup language
DTD document type definition
Example
A 160cm 50kg B 170cm 60kg
ltNgtAlt/NgtltHgt160lt/HgtltWgt50lt/WgtltNgtBlt/NgtltHgt170lt/HgtltWgt60
lt/Wgt

22
PubMed

PubMed citations are indexed by MeSH (Medical
Subject Headings) terms.

NCBI Handbook
23
PubMed Central

PubMed Central (PMC) is the National Library of
Medicine's digital archive of full-text journal
literature.
Journals deposit material in PMC on a voluntary
basis.
Articles in PMC may be retrieved either by
browsing a table of contents for a specific
journal or by searching the database.
Certain journals allow the full text of their
articles to be viewed directly in PMC.
Other journals require that PMC direct users to
the journals own web site to see the full text
of an article. In this case, the material will
always be available free to any user no more than
1 year after publication but will usually be
available only to the journals subscribers for
the first 6 months to 1 year.

24
Literature Databases

Literature databases
PubMed
PubMed Central
Books
OMIM

25
NCBI BookShelf

The BookShelf is a collection of biomedical books
that can be searched directly in Entrez or found
via keyword links in PubMed abstracts.
Books have been added to the BookShelf in
collaboration with authors and publishers, and
the complete content (including all figures and
tables) is free to use for anyone with an
Internet connection.
The online books are displayed one section at a
time, with navigation provided to other parts of
the current chapter or to other chapters within
the book.
Many of the books on the BookShelf can be browsed
without any restriction at all others have less
flexibility for navigating the complete content.
The publisher (or the owner of the content)
defines the rules for access.
The books are linked to PubMed through research
papers citations within the text.

26
NCBI BookShelf
27
NCBI BookShelf
28
Literature Databases

Literature databases
PubMed
PubMed Central
Books
OMIM

29
OMIM

Online Mendelian Inheritance in Man ( OMIMTM) is
a timely, authoritative compendium of
bibliographic material and observations on
inherited disorders and human genes. It is the
continuously updated electronic version of
Mendelian Inheritance in Man (MIM).
MIM was last published in 1998 and is authored
and edited by Dr. Victor A. McKusick and a team
of science writers, editors, scientists, and
physicians at The Johns Hopkins University and
around the world. Curation of the database and
editorial decisions take place at The Johns
Hopkins University School of Medicine.

30
OMIM
31
OMIM
32
OMIM
33
OMIM
34
Literature Databases

Literature databases
PubMed
PubMed Central
Books
OMIM

35
Databases at NCBI

Databases at NCBI
Literature databases
PubMed, PubMed Central, Books, OMIM
Molecular databases
Sequences
EST, STS, GSS, HTGS, HTC, FLIC, UniGene, RefSeq,
HomoloGene
Structures
MMDB, CDD,
Taxonomy
Other databases
GEO, SKY/CGH

36
Molecular Databases

Sequences databases
HTGS, HTCFLIC
EST
STS
GSS
UniGene
RefSeq
HomoloGene
Structures databases
MMDB
CDD
Taxonomy

37
HTGS

High-throughput genomic sequence (HTGS) entries
are submitted in bulk by genome centers,
processed by an automated system, and then
released to GenBank.
To submit sequences in bulk to the HTG processing
system, a center or group must set up an FTP
account. Submitters frequently use two tools to
create HTG submissions, Sequin or fa2htgs.

38
HTGS

Phase 0 sequences are one-to-few reads of a
single clone and are not usually assembled into
contigs. They are low-quality sequences that are
often used to check whether another center is
already sequencing a particular clone.
Phase 1 entries are assembled into contigs that
are separated by sequence gaps, the relative
order and orientation of which are not known.
Phase 2 entries are also unfinished sequences
that may or may not contain sequence gaps. If
there are gaps, then the contigs are in the
correct order and orientation.
Phase 3 sequences are of finished quality and
have no gaps.

NCBI Handbook
39
Genome Sequencing

Bacterial artificial chromosome (BAC) Sequencing

http//www.genomenewsnetwork.org/articles/06_00/se
quence_primer.shtml
40
Genome Sequencing
Nature, Vol. 381, 364-366 (1996)?
http//en.wikipedia.org/
41
Genome Sequencing

Whole Genome Shotgun (WGS) Sequencing

42
Genome Sequencing
Nature, Vol. 381, 364-366 (1996)?
43
Genome Sequencing

BAC sequencing
High precision
Slow
Shotgun sequencing
High throughput
Consume large computational resource
Fast at early stage, but complicated at later
stage

44
HTGS

Submission tools
fa2htgs Command-line program
tbl2asn Command-line program
Sequin Stand-alone bulk submission tool

http//www.ncbi.nlm.nih.gov/Sequin/QuickGuide/sequ
in.htm
45
HTC FLIC

HTC records are High-Throughput cDNA/mRNA
submissions that are similar to ESTs but often
contain more information.
FLIC records, Full-Length Insert cDNA, contain
the entire sequence of a cloned cDNA/mRNA.
Therefore, FLICs are generally longer, and
sometimes even full-length, mRNAs. They are
usually annotated with genes and coding regions,
although these may be lab systematic names rather
than functional names.

46
Molecular Databases

Sequences databases
HTGS, HTCFLIC
EST
STS
GSS
UniGene
RefSeq
HomoloGene
Structures databases
MMDB
CDD
Taxonomy

47
What are Expressed Sequence Tags

ESTs are small pieces of DNA sequence (usually
200 to 500 nucleotides long) that are generated
by sequencing either one or both ends of an
expressed gene. The idea is to sequence bits of
DNA that represent genes expressed in certain
cells, tissues, or organs from different
organisms and use these "tags" to fish a gene out
of a portion of chromosomal DNA by matching base
pairs. The challenge associated with identifying
genes from genomic sequences varies among
organisms and is dependent upon genome size as
well as the presence or absence of introns, the
intervening DNA sequences interrupting the
protein coding sequence of a gene.

http//www.ncbi.nlm.nih.gov/About/primer/est.html
48
What are Expressed Sequence Tags
http//www.ncbi.nlm.nih.gov/About/primer/est.html
49
What are Expressed Sequence Tags
sequencing
sequencing
cDNA
5EST
3EST

Usually 200500 nucleotides long

50
What are Expressed Sequence Tags
Chromosome sequence
Mapping back to chromosome sequence
5EST
3EST
51
Expressed Sequence Tags(ESTs)?
52
Molecular Databases

Sequences databases
HTGS, HTCFLIC
EST
STS
GSS
UniGene
RefSeq
HomoloGene
Structures databases
MMDB
CDD
Taxonomy

53
Sequence clustering

Because a gene can be expressed as mRNA many,
many times, ESTs ultimately derived from this
mRNA may be redundant. That is, there may be many
identical, or similar, copies of the same EST.
Such redundancy and overlap means that when
someone searches dbEST for a particular EST, they
may retrieve a long list of tags, many of which
may represent the same gene. Searching through
all of these identical ESTs can be very time
consuming.
To resolve the redundancy and overlap problem,
NCBI investigators developed the UniGene
database.
UniGene automatically partitions GenBank
sequences into a non-redundant set of
gene-oriented clusters.

http//www.ncbi.nlm.nih.gov/About/primer/est.html
54
Sequence clustering
mRNA
Pre-mRNA
Chromosome
cDNA Library clone No. 1 cDNA Library clone No.
2 cDNA Library clone No. 3 cDNA Library clone No.
4 cDNA Library clone No. 5 cDNA Library clone No.
6
55
Sequence clustering
56
Sequence clustering
57
Sequence clustering
UG No.1
UG No.2
UG No.3
UG No.4
58
Introduction of UniGene database

UniGene Build Procedure - Transcriptome
BasedClustering is the process of finding
subsets of sequences that belong together within
a larger set. This is done by converting discrete
similarity scores to Boolean links between
sequences. That is, two sequences are considered
linked if their similarity exceeds a threshold.
UniGene clustering proceeds in several stages,
with each stage adding less reliable data to the
results of the preceding stage. This staged
clustering affords greater control than a more
egalitarian treatment of all links between
sequences.

http//www.ira.cinvestav.mx8080/GenBioMolI_05/DOC
UMENTOS/HTML/NCBI/UniGene20Build20Procedures.htm
59
UniGene database
60
Sequence clustering
61
Sequence clustering
62
UniGene database
63
UniGene database
64
UniGene database
65
Brief of Cancer Genome Anatomy Project
66
Brief of Cancer Genome Anatomy Project

The goal of CGAP is to determine the gene
expression profiles of normal, precancer, and
cancer cells

67
Brief of Cancer Genome Anatomy Project
68
Digital Differential Display
UniGene
dbEST
CGAP
Gene A
EST No.
Gene A
EST No.
Gene B
EST No.
Gene B
EST No.
Tissue A
Tissue B
Gene C
EST No.
Gene C
EST No.
Gene D
EST No.
Gene D
EST No.
Gene A
EST No.
Gene A
EST No.
Gene B
EST No.
Gene B
EST No.
Tissue C
Tissue D
Gene C
EST No.
Gene C
EST No.
Gene D
EST No.
Gene D
EST No.
69
Digital Differential Display
UniGene
dbEST
CGAP
Gene A
EST No.
Gene A
EST No.
Gene B
EST No.
Gene B
EST No.
Tissue A
Tissue B
Gene C
EST No.
Gene C
EST No.
Gene D
EST No.
Gene D
EST No.
70
Digital Differential Display

DDD is a tool for comparing EST-based expression
profiles among the various libraries, or pools of
libraries, represented in UniGene. These
comparisons allow the identification of those
genes that differ among libraries of different
tissues, making it possible to determine which
genes may be contributing to a cell's unique
characteristics, e.g., those that make a muscle
cell different from a skin or liver cell.
Along similar lines, DDD can be used to try to
identify genes for which the expression levels
differ between normal, premalignant, and
cancerous tissues or different stages of
embryonic development.

71
Digital Differential Display
72
Digital Differential Display
73
Digital Differential Display
74
Digital Differential Display
75
Digital Differential Display
76
Digital Differential Display
77
Digital Differential Display
78
Digital Differential Display
79
Digital Differential Display
80
Molecular Databases

Sequences databases
HTGS, HTCFLIC
EST
STS
GSS
UniGene
RefSeq
HomoloGene
Structures databases
MMDB
CDD
Taxonomy

81
STS

In the National Research Council (NRC)
Committees discussions, there are 2 problems in
generating genome map by PCR
The difficulty of merging mapping data gathered
by diverse methods in different laboratories into
a consensus physical map.
The logistics and expense of managing the huge
collections of cloned segments on which the
mapping data would depend almost absolutely

82
STS

Sequence tagged sites (STSs) are short genomic
landmark sequences. They are operationally unique
in that they are specifically amplified from the
genome by PCR amplification. In addition, they
define a specific location on the genome and are,
therefore, useful for mapping.
In most instances, 200 to 500 b.p. of sequence
define an STS that is operationally unique in the
human genome.

83
STS
84
STS
85
Molecular Databases

Sequences databases
HTGS, HTCFLIC
EST
STS
GSS
UniGene
RefSeq
HomoloGene
Structures databases
MMDB
CDD
Taxonomy

86
GSS

The genome survey sequences (GSS) division of
GenBank is similar to the EST division, with the
exception that most of the sequences are genomic
in origin, rather than cDNA (mRNA). It should be
noted that two classes (exon trapped products and
gene trapped products) may be derived via a cDNA
intermediate. Care should be taken when analyzing
sequences from either of these classes, as a
splicing event could have occurred and the
sequence represented in the record may be
interrupted when compared to genomic sequence.
The GSS division contains (but is not limited to)
the following types of data
random "single pass read" genome survey
sequences.
cosmid/BAC/YAC end sequences
exon trapped genomic sequences
Alu PCR sequences
transposon-tagged sequences

87
GSS

Many labs have approached GenBank over the last
few months, interested in submitting these types
of sequences. We have been reluctant to introduce
them via the existing GenBank divisions. On the
other hand, such sequences are of value to the
genome community, and require similar processing
and access tools as have been provided for EST's
and STS's. GSS sequences will will be used,
amongst other things, as a framework for the
mapping and sequencing of genome size pieces
which will be present in the standard GenBank
divisions.
Sequence data appropriate for the new GSS
division are, to date, generated by genome labs
performing human genome sequencing we expect
that similar data will be generated for other
model organisms, such as the mouse.

88
Molecular Databases

Sequences databases
HTGS, HTCFLIC
EST
STS
GSS
UniGene
RefSeq
HomoloGene
Structures databases
MMDB
CDD
Taxonomy

89
RefSeq

RefSeq biological sequences (also known as
RefSeqs) are derived from GenBank records but
differ in that each RefSeq is a synthesis of
information, not an archived unit of primary
research data.
RefSeq provides a non-redundant framework of
information to facilitate database searches,
whether they are searched via genomic location,
sequence, or text annotation.

90
RefSeq

The RefSeq database is the result of data
extraction from GenBank, curation, and
computation, combined with extensive
collaboration with authoritative groups. Each
molecule is annotated as accurately as possible
with the organism name, strain (or breed,
ecotype, cultivar, or isolate), gene symbol for
that organism, and informative protein name.
In cases when a molecule is represented by
multiple sequences for an organism in GenBank, an
effort is made by NCBI staff to select the "best"
sequence to be presented as a RefSeq. The goal is
to avoid known mutations, sequencing errors,
cloning artifacts, and erroneous annotation.

91
RefSeq
92
RefSeq
93
RefSeq
94
RefSeq
95
RefSeq
96
RefSeq
97
RefSeq
98
Molecular Databases

Sequences databases
HTGS, HTCFLIC
EST
STS
GSS
UniGene
RefSeq
HomoloGene
Structures databases
MMDB
CDD
Taxonomy

99
HomoloGene
http//www.ncbi.nlm.nih.gov/Education/BLASTinfo/Or
thology.html
100
HomoloGene
101
HomoloGene

HomoloGene Build Procedure
The input for HomoloGene processing consists of
the proteins from the input organisms. These
sequences are compared to one another (using
blastp) and then are matched up and put into
groups, using a tree built from sequence
similarity to guide the process, where closer
related organisms are matched up first, and then
further organisms are added as the tree is
traversed toward the root. The protein alignments
are mapped back to their corresponding DNA
sequences, where distance metrics can be
calculated (e.g. molecular distance, Ka/Ks
ratio). Sequences are matched using synteny when
applicable. Remaining sequences are matched up by
using an algorithm for maximizing the score
globally, rather than locally, in a bipartite
matching. Cutoffs on bits per position and Ks
values are set to prevent unlikely "orthologs"
from being grouped together. These cutoffs are
calculated based on the respective score
distribution for the given groups of organisms.
Paralogs are identified by finding sequences that
are closer within species than other species.

102
HomoloGene
103
HomoloGene
104
Molecular Databases

Sequences databases
HTGS, HTCFLIC
EST
STS
GSS
UniGene
RefSeq
HomoloGene
Structures databases
MMDB
CDD
Taxonomy

105
MMDB

Molecular modeling database (MMDB) is based on
the structures within Protein Data Bank (PDB) and
can be queried using the Entrez search engine, as
well as via the more direct but less flexible
structure summary search. Once found, any
structure of interest can be viewed using Cn3D, a
piece of software that can be freely downloaded
for Mac, PC, and UNIX platforms.

106
MMDB
107
MMDB
108
MMDB
109
MMDB
110
MMDB

VAST Search is a WWW service which allows you to
compare the 3-dimensional structure of an input
protein with other protein structures in NCBI's
MMDB, using the VAST algorithm.
VAST Search is NCBI's structure-structure
similarity search service. It compares 3D
coordinates of a newly determined protein
structure to those in the MMDB/PDB database. VAST
Search computes a list of structure neighbors
that you may browse interactively, viewing
super-positions and alignments by molecular
graphics.
The output of the pre-computed VAST searches is a
list of structure records, each representing one
of the non-redundant PDB chain sets (nr-PDB),
which can also be downloaded. There are four
clustered subsets of MMDB that compose nr-PDB,
each consisting of clusters having a preset level
of sequence similarity.

111
MMDB
112
MMDB
113
Molecular Databases

Sequences databases
HTGS, HTCFLIC
EST
STS
GSS
UniGene
RefSeq
HomoloGene
Structures databases
MMDB
CDD
Taxonomy

114
CDD

The collections of domain alignments in the
conserved domain database (CDD) are imported
either from two databases outside of the NCBI,
named Pfam and simple modular architecture
research tool (SMART) from the NCBI COG
database from another NCBI collection named
library of ancient domain (LOAD) and from a
database curated by the CDD staff.

115
CDD
116
CDD
117
CDD
118
CDD
119
CDD

Given a query sequence, CDART shows the
functional domains that make up a protein and
then lists proteins with a similar domain
architecture. The functional domains for a
sequence are found by RPS-BLAST, which defines a
domain by a PSSM (Position-specific scoring
matrices), a set of probabilities of amino acids
existing at each position of the domain.
RPS-BLAST is known as a "profile" search, which
is a sensitive way to look for sequence
homologues.

120
CDD
121
CDD
122
Molecular Databases

Sequences databases
HTGS, HTCFLIC
EST
STS
GSS
UniGene
RefSeq
HomoloGene
Structures databases
MMDB
CDD
Taxonomy

123
Taxonomy

The NCBI Taxonomy database is a curated set of
names and classifications for all of the
organisms that are represented in GenBank. When
new sequences are submitted to GenBank, the
submission is checked for new organism names,
which are then classified and added to the
Taxonomy database.
Of the several different ways to build a
taxonomy, our group maintains a phylogenetic
taxonomy. In a phylogenetic classification
scheme, the structure of the taxonomic tree
approximates the evolutionary relationships among
the organisms included in the classification.

124
Taxonomy
125
Taxonomy
126
Taxonomy
127
Molecular Databases

Sequences databases
HTGS, HTCFLIC
EST
STS
GSS
UniGene
RefSeq
HomoloGene
Structures databases
MMDB
CDD
Taxonomy

128
Databases at NCBI

Databases at NCBI
Literature databases
PubMed, PubMed Central, Books, OMIM
Molecular databases
Sequences
EST, STS, GSS, HTGS, HTC, FLIC, UniGene, RefSeq,
HomoloGene
Structures
MMDB, CDD,
Taxonomy
Other databases
GEO, SKY/CGH

129
Other Databases

Other databases
GEO
SKY/CGH

130
GEO

The Gene Expression Omnibus (GEO) project was
initiated at NCBI in 1999 in response to the
growing demand for a public repository for data
generated from high-throughput microarray
experiments. GEO has a flexible and open design
that allows the submission, storage, and
retrieval of many types of data sets, such as
those from high-throughput gene expression,
genomic hybridization, and antibody array
experiments.

131
GEO
132
GEO
133
GEO
134
GEO
135
GEO
136
GEO
137
Other Databases

Other databases
GEO
SKY/CGH

138
SKY/CGH

Spectral Karyotyping (SKY) and Comparative
Genomic Hybidization (CGH) are complementary
fluorescent molecular cytogenetic techniques that
have revolutionized the detection of chromosomal
abnormalities.
SKY permits the simultaneous visualization of all
human or mouse chromosomes in a different color,
facilitating the detection of chromosomal
trans-locations and rearrangements.
CGH uses the hybridization of differentially
labeled tumor and reference DNA to generate a map
of DNA copy number changes in tumor genomes.

139
SKY/CGH
140
SKY/CGH
141
SKY/CGH
142
SKY/CGH
143
SKY/CGH
144
SKY/CGH
145
SKY/CGH
146
SKY/CGH
147
Other Databases

Other databases
GEO
SKY/CGH

148
Databases at NCBI

Databases at NCBI
Literature databases
PubMed, PubMed Central, Books, OMIM
Molecular databases
Sequences
EST, STS, GSS, HTGS, HTC, FLIC, UniGene, RefSeq,
HomoloGene
Structures
MMDB, CDD,
Taxonomy
Other databases
GEO, SKY/CGH

149
Entrez

Entrez is the text-based search and retrieval
system used at NCBI for all of the major
databases, including PubMed, Nucleotide and
Protein Sequences, Protein Structures, Complete
Genomes, Taxonomy, OMIM, and many others. Entrez
is at once an indexing and retrieval system, a
collection of data from many sources, and an
organizing principle for biomedical information.

150
Entrez
151
Entrez
152
Entrez
153
Databases at NCBI
154
(No Transcript)
155
(No Transcript)

Write a Comment

User Comments (0)

About PowerShow.com

Databases at NCBI PowerPoint PPT Presentation