Interface of biology and computers - PowerPoint PPT Presentation

1 / 68

About This Presentation

Title:

Interface of biology and computers

Description:

National Library of Medicine's search service. 16 million citations in MEDLINE ... complementary DNA (cDNA) library. UniGene data come from many cDNA libraries. ... – PowerPoint PPT presentation

Number of Views:34

Avg rating:3.0/5.0

Slides: 69

Provided by: jonathan417

Category:

more less

Transcript and Presenter's Notes

Title: Interface of biology and computers

1
What is bioinformatics?

Interface of biology and computers
Analysis of proteins, genes and genomes
using computer algorithms and
computer databases
Genomics is the analysis of genomes.
The tools of bioinformatics are used to make
sense of the billions of base pairs of DNA
that are sequenced by genomics projects.

2
Top ten challenges for bioinformatics
1 Precise models of where and when
transcription will occur in a genome
(initiation and termination) 2 Precise,
predictive models of alternative RNA
splicing 3 Precise models of signal
transduction pathways ability to predict
cellular responses to external stimuli 4
Determining proteinDNA, proteinRNA,
proteinprotein recognition codes 5
Accurate ab initio protein structure prediction
3
Top ten challenges for bioinformatics
6 Rational design of small molecule inhibitors
of proteins 7 Mechanistic understanding of
protein evolution 8 Mechanistic understanding
of speciation 9 Development of effective gene
ontologies systematic ways to describe
gene and protein function 10 Education
development of bioinformatics curricula
Source Ewan Birney, Chris Burge, Jim Fickett
4
After Pace NR (1997) Science 276734
5
DNA
RNA
phenotype
protein
6
Growth of GenBank
Base pairs of DNA (billions)
Sequences (millions)
Fig. 2.1 Page 17
1982
1986
1990
1994
1998
2002
Updated 8-12-04 gt40b base pairs
Year
7
Central dogma of molecular biology
DNA
RNA
protein
8
DNA
RNA
phenotype
protein
protein sequence databases
cDNA ESTs UniGene
genomic DNA databases
9
There are three major public DNA databases
GenBank
EMBL
DDBJ
Housed at EBI European Bioinformatics Institute
Housed at NCBI National Center
for Biotechnology Information
Housed in Japan
The underlying raw DNA sequences are identical
10
Caveats

Remember what you are looking for
Seems obvious, but sometimes it isnt
Formats!
Due to the hodge-podge history of database
development and sequence data acquisition, MANY
different formats exist
Dont feed the wrong format to a search engine or
you wont get a response
Stay focused and alert.
Many times youll get a hit that was not exactly
what you were looking for. It may lead you
someplace you werent expecting to be, but you
may be glad to be there!

11
National Center for Biotechnology Information
(NCBI) www.ncbi.nlm.nih.gov
12
www.ncbi.nlm.nih.gov
13
(No Transcript)
14

PubMed is
National Library of Medicine's search service
16 million citations in MEDLINE
links to participating online journals
PubMed tutorial (via Education on side bar)

Entrez integrates
the scientific literature
DNA and protein sequence databases
3D protein structure data
population study data sets
assemblies of complete genomes

16
Entrez is a search and retrieval system that
integrates NCBI databases
17

BLAST is
Basic Local Alignment Search Tool
NCBI's sequence similarity search tool
supports analysis of DNA and protein databases
100,000 searches per day

Structure site includes
Molecular Modelling Database (MMDB)
biopolymer structures obtained from
the Protein Data Bank (PDB)
Cn3D (a 3D-structure viewer)
vector alignment search tool (VAST)

19
Accessing information on molecular sequences
20
Accession numbers are labels for sequences
NCBI includes databases (such as GenBank) that
contain information on DNA, RNA, or protein
sequences. You may want to acquire information
beginning with a query such as the name of a
protein of interest, or the raw nucleotides
comprising a DNA sequence of interest. DNA
sequences and other molecular data are tagged
with accession numbers that are used to identify
a sequence or other record relevant to molecular
data.
21
What is an accession number?
An accession number is label that used to
identify a sequence. It is a string of letters
and/or numbers that corresponds to a molecular
sequence. Examples (all for retinol-binding
protein, RBP4) X02775 GenBank genomic DNA
sequence NT_030059 Genomic contig Rs7079946 dbSNP
(single nucleotide polymorphism) N91759.1 An
expressed sequence tag (1 of 170) NM_006744 RefSeq
DNA sequence (from a transcript) NP_007635 RefSe
q protein AAC02945 GenBank protein Q28369 SwissPr
ot protein 1KT7 Protein Data Bank structure
record
DNA
RNA
protein
22
Four ways to access DNA and protein sequences
1 Entrez Gene with RefSeq 2 UniGene 3
European Bioinformatics Institute (EBI) and
Ensembl (separate from NCBI) 4 ExPASy Sequence
Retrieval System (separate from NCBI)
23
4 ways to access protein and DNA sequences
1 Entrez Gene with RefSeq Entrez Gene is a
great starting point it collects key information
on each gene/protein from major databases. It
covers all major organisms. RefSeq provides a
curated, optimal accession number for each DNA
(NM_006744) or protein (NP_007635)
24
From the NCBI home page, type rbp4 and hit Go
25
(No Transcript)
26
(No Transcript)
27
(No Transcript)
28
By applying limits, there are now just two entries
29
Entrez Gene (top of page)
Note that links to many other RBP4 database
entries are available
30
Entrez Gene (middle of page)
31
Entrez Gene (bottom of page)
32
(No Transcript)
33
(No Transcript)
34
(No Transcript)
35
FASTA format
36
What is an accession number?
An accession number is label that used to
identify a sequence. It is a string of letters
and/or numbers that corresponds to a molecular
sequence. Examples (all for retinol-binding
protein, RBP4) X02775 GenBank genomic DNA
sequence NT_030059 Genomic contig Rs7079946 dbSNP
(single nucleotide polymorphism) N91759.1 An
expressed sequence tag (1 of 170) NM_006744 RefSeq
DNA sequence (from a transcript) NP_007635 RefSe
q protein AAC02945 GenBank protein Q28369 SwissPr
ot protein 1KT7 Protein Data Bank structure
record
DNA
RNA
protein
37
NCBIs important RefSeq project best
representative sequences
RefSeq (accessible via the main page of
NCBI) provides an expertly curated accession
number that corresponds to the most stable,
agreed-upon reference version of a sequence.
RefSeq identifiers include the following
formats Complete genome NC_ Complete
chromosome NC_ Genomic contig NT_ mRN
A (DNA format) NM_ e.g. NM_006744 Protein
NP_ e.g. NP_006735
38
NCBIs RefSeq project accession for genomic,
mRNA, protein sequences
Accession Molecule Method Note AC_123456
Genomic Mixed Alternate complete
genomic AP_123456 Protein Mixed Protein
products alternate NC_123456
Genomic Mixed Complete genomic
molecules NG_123456 Genomic Mixed Incomplet
e genomic regions NM_123456
mRNA Mixed Transcript products mRNA
NM_123456789 mRNA Mixed Transcript
products 9-digit NP_123456
Protein Mixed Protein products NP_123456789
Protein Curation Protein products 9-digit
NR_123456 RNA Mixed Non-coding
transcripts NT_123456 Genomic Automated Gen
omic assemblies NW_123456
Genomic Automated Genomic assemblies
NZ_ABCD12345678 Genomic Automated Whole genome
shotgun data XM_123456 mRNA Automated Transc
ript products XP_123456 Protein Automated Pr
otein products XR_123456 RNA Automated Tran
script products YP_123456 Protein Auto.
Curated Protein products ZP_12345678
Protein Automated Protein products
39
Four ways to access DNA and protein sequences
1 Entrez Gene with RefSeq 2 UniGene 3
European Bioinformatics Institute (EBI) and
Ensembl (separate from NCBI) 4 ExPASy Sequence
Retrieval System (separate from NCBI)
40
protein
DNA
RNA
complementary DNA (cDNA)
UniGene
41
UniGene unique genes via ESTs

Find UniGene at NCBI
www.ncbi.nlm.nih.gov/UniGene
UniGene clusters contain many expressed sequence
tags (ESTs), which are DNA sequences (typically
500 base pairs in length) corresponding to the
mRNA
from an expressed gene. ESTs are sequenced from
a
complementary DNA (cDNA) library.
UniGene data come from many cDNA libraries.
Thus, when you look up a gene in UniGene
you get information on its abundance
and its regional distribution.

42
Cluster sizes in UniGene
This is a gene with 1 EST associated the cluster
size is 1
43
Cluster sizes in UniGene
This is a gene with 10 ESTs associated the
cluster size is 10
44
Cluster sizes in UniGene (human)
Cluster size (ESTs) Number of clusters 1 ?
42,800 2 6,500 3-4 6,500 5-8 5,400 9-16
4,100 17-32 3,300 ?500-1000 2,128 ?2000-4
000 233 ?8000-16,000 21 ?16,000-30,000 8
UniGene build 194, 8/06
45
UniGene unique genes via ESTs
Conclusion UniGene is a useful tool to look
up information about expressed genes.
UniGene displays information about the abundance
of a transcript (expressed gene), as well as its
regional distribution of expression (e.g. brain
vs. liver).
46
Five ways to access DNA and protein sequences
1 Entrez Gene with RefSeq 2 UniGene 3
European Bioinformatics Institute (EBI) and
Ensembl (separate from NCBI) 4 ExPASy Sequence
Retrieval System (separate from NCBI)
47
Ensembl to access protein and DNA sequences
Try Ensembl at www.ensembl.org for a
premier human genome web browser.
48
click human
49
enter RBP4
50
(No Transcript)
51
Five ways to access DNA and protein sequences
1 Entrez Gene with RefSeq 2 UniGene 3
European Bioinformatics Institute (EBI) and
Ensembl (separate from NCBI) 4 ExPASy Sequence
Retrieval System (separate from NCBI)
52
ExPASy to access protein and DNA sequences
ExPASy sequence retrieval system (ExPASy Expert
Protein Analysis System) Visit
http//www.expasy.ch/
53
(No Transcript)
54
(No Transcript)
55
Example of how to access sequence data HIV-1 pol
There are many possible approaches. Begin at the
main page of NCBI, and type an Entrez query
hiv-1 pol
56
(No Transcript)
57
Searching for HIV-1 pol Following the genome
link yields a manageable three results
58
Example of how to access sequence data HIV-1 pol
For the Entrez query hiv-1 pol there are about
40,000 nucleotide or protein records (and
gt100,000 records for a search for hiv-1), but
these can easily be reduced in two easy
steps --specify the organism, e.g.
hiv-1organism --limit the output to RefSeq!
59
over 100,000 nucleotide entries for HIV-1
only 1 RefSeq
60
Examples of how to access sequence data histone
query for histone results protein
records 21847 RefSeq entries 7544 RefSeq
(limit to human) 1108 NOT deacetylase 697 At
this point, select a reasonable candidate
(e.g. histone 2, H4) and follow its link to
Entrez Gene. There, you can confirm you have the
right gene/protein.
8-12-06
61
(No Transcript)
62
Access to Biomedical Literature
63
PubMed at NCBI to find literature information
64
PubMed is the NCBI gateway to MEDLINE. MEDLINE
contains bibliographic citations and author
abstracts from over 4,600 journals published in
the United States and in 70 foreign countries.
It has gt14 million records dating back to 1966.
65
MeSH is the acronym for "Medical Subject
Headings." MeSH is the list of the vocabulary
terms used for subject analysis of biomedical
literature at NLM. MeSH vocabulary is used for
indexing journal articles for MEDLINE. The
MeSH controlled vocabulary imposes uniformity
and consistency to the indexing of biomedical
literature.
66
(No Transcript)
67
(No Transcript)
68
PubMed search strategies
Try the tutorial (education on the left
sidebar) Use boolean queries (capitalize AND,
OR, NOT) lipocalin AND disease Try using
limits Try Links to find Entrez information
and external resources Obtain articles on-line
via Welch Medical Library (and download pdf
files) http//www.welch.jhu.edu/
Page 35

Write a Comment

User Comments (0)