Introduction to Bioinformatics

About This Presentation

Title:

Introduction to Bioinformatics

Description:

People with very diverse backgrounds in biology. Some people with ... science and biostatistics. Most people (will) have a favorite gene, protein, or disease ... – PowerPoint PPT presentation

Number of Views:239

Avg rating:3.0/5.0

Slides: 139

Provided by: jonathan82

Category:

more less

Transcript and Presenter's Notes

Title: Introduction to Bioinformatics

1
Introduction to Bioinformatics Monday,
November 17, 2008 Jonathan Pevsner pevsner_at_jhmi.ed
u Bioinformatics M.E800.707
2
Teaching assistants!
Bethany Drehman bethfoxglove_at_gmail.com Cheng
Ran (Lisa) Huang huangchengran_at_gmail.com
3
Who is taking this course?

People with very diverse backgrounds in biology
Some people with backgrounds in computer
science and biostatistics
Most people (will) have a favorite gene,
protein, or disease

4
What are the goals of the course?

To provide an introduction to bioinformatics
with
a focus on the National Center for
Biotechnology
Information (NCBI) and EBI
To focus on the analysis of DNA, RNA and
proteins
To introduce you to the analysis of genomes
To combine theory and practice to help you
solve research problems

5
Themes throughout the course
Textbooks Web sites Literature
references Gene/protein families Computer labs
6
Textbook
The course textbook has no required textbook. I
wrote Bioinformatics and Functional Genomics
(Wiley, 2003). The seven lectures in this course
correspond closely to chapters. An electronic
version is available on the Welch Library
website. A few copies will be available on
reserve at Welch Library, and the library has six
more copies. I recommend several other
bioinformatics texts Baxevanis and
Ouellette David Mount Durbin et al.
7
Visit http//www.welch.jhu.edu Search for
bioinformatics in ebook titles
8
Visit http//www.welch.jhu.edu Search for
bioinformatics in ebook titles
9
Web sites
The course website is reached via moodle
http//pevsnerlab.kennedykrieger.org/moodle (or
Google moodle bioinformatics) --This site
contains the powerpoints for each lecture.
including color and black white versions --The
weekly quizzes are here --You can ask questions
via the forum The textbook website
is http//www.bioinfbook.org This has
powerpoints, URLs, etc. organized by chapter
10
Literature references
You are encouraged to read original source
articles (posted on moodle). They will enhance
your understanding of the material. Readings are
optional but recommended.
11
Themes throughout the course gene/protein
families
We will use beta globin and retinol-binding
protein 4 (RBP4) as model genes/proteins
throughout the course. Globins including
hemoglobin and myoglobin carry oxygen. RBP4 is a
member of the lipocalin family. It is a small,
abundant carrier protein. We will study globins
and lipocalins in a variety of contexts
including --sequence alignment --gene
expression --protein structure --phylogeny --ho
mologs in various species
12
(No Transcript)
13
Computer labs
There are three computer labs. I STRONGLY
encourage you to bring a laptop to class. Also,
the seven weekly quizzes function as a computer
lab to solve the questions, you may need to go
to a website and use databases or software.
14
Grading
60 moodle quizzes (best six out of seven).
Quizzes are taken at the moodle website,
and are due one week after the relevant
lecture 40 final exam Tuesday, January 12 (in
class). Closed book, cumulative, no
computer, short answer / multiple choice. Past
exams will be made available ahead of time.
15
Google moodle bioinformatics to get here Click
Introduction to Bioinformatics to sign in The
enrollment key is
16
Outline for the course
1. Accessing information about DNA and
proteins Nov. 17 2. Pairwise alignment Nov.
24 3. BLAST Dec. 1 LAB 1 of 3 Dec.
1 4. Multiple sequence alignment Dec. 8 5.
Molecular phylogeny and evolution Dec. 15 LAB
2 of 3 Dec. 15 6. Proteomics Dec.
22 7. Gene expression microarrays Jan. 5 LAB
3 of 3 Jan. 5 Final exam Jan. 12
17
Outline for today
Definition of bioinformatics Overview of the
NCBI website Accessing information about DNA and
proteins --Definition of an accession
number --Four ways to find information on
proteins and DNA Access to biomedical
literature Pairwise alignment introduction
18
What is bioinformatics?

Interface of biology and computers
Analysis of proteins, genes and genomes
using computer algorithms and
computer databases
Genomics is the analysis of genomes.
The tools of bioinformatics are used to make
sense of the billions of base pairs of DNA
that are sequenced by genomics projects.

19
On bioinformatics
Science is about building causal relations
between natural phenomena (for instance, between
a mutation in a gene and a disease). The
development of instruments to increase our
capacity to observe natural phenomena has,
therefore, played a crucial role in the
development of science - the microscope being the
paradigmatic example in biology. With the human
genome, the natural world takes an unprecedented
turn it is better described as a sequence of
symbols. Besides high-throughput machines such as
sequencers and DNA chip readers, the computer and
the associated software becomes the instrument to
observe it, and the discipline of bioinformatics
flourishes.
20
On bioinformatics
However, as the separation between us (the
observers) and the phenomena observed increases
(from organism to cell to genome, for instance),
instruments may capture phenomena only
indirectly, through the footprints they leave.
Instruments therefore need to be calibrated the
distance between the reality and the observation
(through the instrument) needs to be accounted
for. This issue of Genome Biology is about
calibrating instruments to observe gene
sequences more specifically, computer programs
to identify human genes in the sequence of the
human genome. Martin Reese and Roderic Guigó,
Genome Biology 2006 7(Suppl I)S1, introducing
EGASP, the Encyclopedia of DNA Elements (ENCODE)
Genome Annotation Assessment Project
21
Tool-users
Tool-makers
22
Three perspectives on bioinformatics
The cell The organism The tree of life
Page 4
23
(No Transcript)
24
DNA
RNA
phenotype
protein
Page 5
25
Time of development
Body region, physiology, pharmacology, pathology
Page 5
26
After Pace NR (1997) Science 276734
Page 6
27
DNA
RNA
phenotype
protein
28
Growth of GenBank
Base pairs of DNA (billions)
Sequences (millions)
Fig. 2.1 Page 17
1982
1986
1990
1994
1998
2002
Year
29
Growth of GenBank Whole Genome
Shotgun (1982-November 2008)
250
200
150
Number of sequences in GenBank (millions)
Base pairs of DNA in GenBank (billions) Base
pairs in GenBank WGS (billions)
100
50
0
1982
1987
1992
1997
2002
2007
30
Central dogma of molecular biology
DNA
RNA
protein
31
DNA
RNA
phenotype
protein
protein sequence databases
cDNA ESTs UniGene
genomic DNA databases
Fig. 2.2 Page 20
32
There are three major public DNA databases
GenBank
EMBL
DDBJ
The underlying raw DNA sequences are identical
Page 16
33
There are three major public DNA databases
GenBank
EMBL
DDBJ
Housed at EBI European Bioinformatics Institute
Housed at NCBI National Center
for Biotechnology Information
Housed in Japan
Page 16
34
The Trace Archive at NCBI contains over 2 billion
traces
11/08
35
Taxonomy at NCBI 200,000 species are
represented in GenBank
http//www.ncbi.nlm.nih.gov/Taxonomy/txstat.cgi
11/08
36
The most sequenced organisms in GenBank
Homo sapiens 13.1 billion bases Mus musculus
8.4b Rattus norvegicus 6.1b Bos
taurus 5.2b Zea mays 4.6b Sus
scrofa 3.6b Danio rerio 3.0b Oryza sativa
(japonica) 1.5b Strongylocentrotus
purpurata 1.4b Nicotiana tabacum 1.1b
Updated 11-6-08 GenBank release 168.0 Excluding
WGS, organelles, metagenomics
Table 2-2 Page 18
37
National Center for Biotechnology Information
(NCBI) www.ncbi.nlm.nih.gov
Page 24
38
Fig. 2.5 Page 25
www.ncbi.nlm.nih.gov
39
Fig. 2.5 Page 25
40

PubMed is
National Library of Medicine's search service
16 million citations in MEDLINE
links to participating online journals
PubMed tutorial (via Education on side bar)

Page 24
41
(No Transcript)
42

Entrez integrates
the scientific literature
DNA and protein sequence databases
3D protein structure data
population study data sets
assemblies of complete genomes

Page 24
43
Entrez is a search and retrieval system that
integrates NCBI databases
Page 24
44

BLAST is
Basic Local Alignment Search Tool
NCBI's sequence similarity search tool
supports analysis of DNA and protein databases
100,000 searches per day

Page 25
45

OMIM is
Online Mendelian Inheritance in Man
catalog of human genes and genetic disorders
created by Dr. Victor McKusick led by Dr. Ada
Hamosh
at JHMI

Page 25
46

Books is
searchable resource of on-line books

Page 26
47

TaxBrowser is
browser for the major divisions of living
organisms
(archaea, bacteria, eukaryota, viruses)
taxonomy information such as genetic codes
molecular data on extinct organisms

Page 26
48

Structure site includes
Molecular Modelling Database (MMDB)
biopolymer structures obtained from
the Protein Data Bank (PDB)
Cn3D (a 3D-structure viewer)
vector alignment search tool (VAST)

Page 26
49
Outline for today
Definition of bioinformatics Overview of the
NCBI website Accessing information about DNA and
proteins --Definition of an accession
number --Five ways to find information on
proteins and DNA Access to biomedical
literature Pairwise alignment introduction
50
Accession numbers are labels for sequences
NCBI includes databases (such as GenBank) that
contain information on DNA, RNA, or protein
sequences. You may want to acquire information
beginning with a query such as the name of a
protein of interest, or the raw nucleotides
comprising a DNA sequence of interest. DNA
sequences and other molecular data are tagged
with accession numbers that are used to identify
a sequence or other record relevant to molecular
data.
Page 26
51
What is an accession number?
An accession number is label that used to
identify a sequence. It is a string of letters
and/or numbers that corresponds to a molecular
sequence. Examples (all for retinol-binding
protein, RBP4) X02775 GenBank genomic DNA
sequence NT_030059 Genomic contig Rs7079946 dbSNP
(single nucleotide polymorphism) N91759.1 An
expressed sequence tag (1 of 170) NM_006744 RefSeq
DNA sequence (from a transcript) NP_007635 RefSe
q protein AAC02945 GenBank protein Q28369 SwissPr
ot protein 1KT7 Protein Data Bank structure
record
DNA
RNA
protein
Page 27
52
Five ways to access DNA and protein sequences
1 Entrez Gene with RefSeq 2 UniGene 3
European Bioinformatics Institute (EBI) and
Ensembl (separate from NCBI) 4 ExPASy Sequence
Retrieval System (separate from NCBI) 5
UCSC Genome Browser
Page 27
53
5 ways to access protein and DNA sequences
1 Entrez Gene with RefSeq Entrez Gene is a
great starting point it collects key information
on each gene/protein from major databases. It
covers all major organisms. RefSeq provides a
curated, optimal accession number for each DNA
(NM_006744) or protein (NP_007635)
Page 27
54
From the NCBI home page, type beta globin and
hit Go
revised 11/08 Fig. 2.7 Page 29
55
revised Fig. 2.7 Page 29
56
(No Transcript)
57
(No Transcript)
58
By applying limits, there are now fewer entries
59
Entrez Gene (top of page)
Note that links to many other HBB database
entries are available
revised Fig. 2.8 Page 30
60
Entrez Gene (middle of page)
61
Entrez Gene (middle of page, continued)
62
Entrez Gene (bottom of page) RefSeqs
63
Entrez Gene (bottom of page) non-RefSeq
accessions
64
Fig. 2.9 Page 32
65
Fig. 2.9 Page 32
66
Fig. 2.9 Page 32
67
FASTA format versatile, compact with gtone
header line followed by a string of nucleotides
or amino acids in the single letter code
Fig. 2.10 Page 32
68
What is an accession number?
An accession number is label that used to
identify a sequence. It is a string of letters
and/or numbers that corresponds to a molecular
sequence. Examples X02775 GenBank genomic
DNA sequence NT_030059 Genomic contig Rs7079946 db
SNP (single nucleotide polymorphism) N91759.1 An
expressed sequence tag (1 of hundreds) NM_006744 R
efSeq DNA sequence (from a transcript) NP_007635
RefSeq protein AAC02945 GenBank
protein Q28369 SwissProt protein 1KT7 Protein
Data Bank structure record
DNA
RNA
protein
Page 27
69
NCBIs important RefSeq project best
representative sequences
RefSeq (accessible via the main page of
NCBI) provides an expertly curated accession
number that corresponds to the most stable,
agreed-upon reference version of a sequence.
RefSeq identifiers include the following
formats Complete genome NC_ Complete
chromosome NC_ Genomic contig NT_ mRN
A (DNA format) NM_ e.g. NM_006744 Protein
NP_ e.g. NP_006735
Page 29-30
70
NCBIs RefSeq project accession for genomic,
mRNA, protein sequences
Accession Molecule Method Note AC_123456
Genomic Mixed Alternate complete
genomic AP_123456 Protein Mixed Protein
products alternate NC_123456
Genomic Mixed Complete genomic
molecules NG_123456 Genomic Mixed Incomplet
e genomic regions NM_123456
mRNA Mixed Transcript products mRNA
NM_123456789 mRNA Mixed Transcript
products 9-digit NP_123456
Protein Mixed Protein products NP_123456789
Protein Curation Protein products 9-digit
NR_123456 RNA Mixed Non-coding
transcripts NT_123456 Genomic Automated Gen
omic assemblies NW_123456
Genomic Automated Genomic assemblies
NZ_ABCD12345678 Genomic Automated Whole genome
shotgun data XM_123456 mRNA Automated Transc
ript products XP_123456 Protein Automated Pr
otein products XR_123456 RNA Automated Tran
script products YP_123456 Protein Auto.
Curated Protein products ZP_12345678
Protein Automated Protein products
71
Five ways to access DNA and protein sequences
1 Entrez Gene with RefSeq 2 UniGene 3
European Bioinformatics Institute (EBI) and
Ensembl (separate from NCBI) 4 ExPASy Sequence
Retrieval System (separate from NCBI) 5
UCSC Genome Browser
Page 31
72
protein
DNA
RNA
complementary DNA (cDNA)
UniGene
Fig. 2.3 Page 23
73
UniGene unique genes via ESTs

Find UniGene at NCBI
www.ncbi.nlm.nih.gov/UniGene
UniGene clusters contain many expressed sequence
tags (ESTs), which are DNA sequences (typically
500 base pairs in length) corresponding to the
mRNA
from an expressed gene. ESTs are sequenced from
a
complementary DNA (cDNA) library.
UniGene data come from many cDNA libraries.
Thus, when you look up a gene in UniGene
you get information on its abundance
and its regional distribution.

Pages 20-21
74
Cluster sizes in UniGene
This is a gene with 1 EST associated the cluster
size is 1
Fig. 2.3 Page 23
75
Cluster sizes in UniGene
This is a gene with 10 ESTs associated the
cluster size is 10
76
Cluster sizes in UniGene (human)
Cluster size (ESTs) Number of clusters 1 ?
40,300 2 18,500 3-4 18,000 5-8 13,400 9-
16 8,100 17-32 5,200 ?500-1000 1,900 ?100
0-4000 940 ?4000-16,000 74 ?16,000-65,000 8
1600070000ESTC
UniGene build 216, 11/08
77
UniGene unique genes via ESTs
Conclusion UniGene is a useful tool to look
up information about expressed genes.
UniGene displays information about the abundance
of a transcript (expressed gene), as well as its
regional distribution of expression (e.g. brain
vs. liver). We will discuss UniGene further on
January 5 (gene expression).
Page 31
78
Five ways to access DNA and protein sequences
1 Entrez Gene with RefSeq 2 UniGene 3
European Bioinformatics Institute (EBI) and
Ensembl (separate from NCBI) 4 ExPASy Sequence
Retrieval System (separate from NCBI) 5
UCSC Genome Browser
Page 31
79
Ensembl to access protein and DNA sequences
Try Ensembl at www.ensembl.org for a
premier human genome web browser. We will
encounter Ensembl as we study the human
genome, BLAST, and other topics.
80
click human
81
enter RBP4
82
(No Transcript)
83
Five ways to access DNA and protein sequences
1 Entrez Gene with RefSeq 2 UniGene 3
European Bioinformatics Institute (EBI) and
Ensembl (separate from NCBI) 4 ExPASy Sequence
Retrieval System (separate from NCBI) 5
UCSC Genome Browser
Page 33
84
ExPASy to access protein and DNA sequences
ExPASy sequence retrieval system (ExPASy Expert
Protein Analysis System) Visit
http//www.expasy.ch/
Page 33
85
Fig. 2.11 Page 33
86
(No Transcript)
87
Five ways to access DNA and protein sequences
1 Entrez Gene with RefSeq 2 UniGene 3
European Bioinformatics Institute (EBI) and
Ensembl (separate from NCBI) 4 ExPASy Sequence
Retrieval System (separate from NCBI) 5
UCSC Genome Browser
Page 33
88
1 Visit http//genome.ucsc.edu/, click Genome
Browser
2 Choose organisms, enter query (beta globin),
hit submit
89
Example of how to access sequence data HIV-1 pol
There are many possible approaches. Begin at the
main page of NCBI, and type an Entrez query
hiv-1 pol
Page 34
90
11/08
91
Searching for HIV-1 pol Following the genome
link yields a manageable five results
Page 34
92
Example of how to access sequence data HIV-1 pol
For the Entrez query hiv-1 pol there are about
80,000 nucleotide or protein records (and
gt200,000 records for a search for hiv-1), but
these can easily be reduced in two easy
steps --specify the organism, e.g.
hiv-1organism --limit the output to RefSeq!
Page 34
93
over 200,000 nucleotide entries for HIV-1
only 1 RefSeq
94
Examples of how to access sequence data histone
query for histone results protein
records 21847 RefSeq entries 7544 RefSeq
(limit to human) 1108 NOT deacetylase 697 At
this point, select a reasonable candidate
(e.g. histone 2, H4) and follow its link to
Entrez Gene. There, you can confirm you have the
right gene/protein.
8-12-06
95
(No Transcript)
96
Outline for today
Definition of bioinformatics Overview of the
NCBI website Accessing information about DNA and
proteins --Definition of an accession
number --Four ways to find information on
proteins and DNA Access to biomedical
literature Pairwise alignment introduction
97
PubMed at NCBI to find literature information
98
PubMed is the NCBI gateway to MEDLINE. MEDLINE
contains bibliographic citations and author
abstracts from over 4,600 journals published in
the United States and in 70 foreign countries.
It has gt18 million records dating back to 1950s.
Updated 11-08
Page 35
99
MeSH is the acronym for "Medical Subject
Headings." MeSH is the list of the vocabulary
terms used for subject analysis of biomedical
literature at NLM. MeSH vocabulary is used for
indexing journal articles for MEDLINE. The
MeSH controlled vocabulary imposes uniformity
and consistency to the indexing of biomedical
literature.
Page 35
100
(No Transcript)
101
(No Transcript)
102
PubMed search strategies
Try the tutorial (education on the left
sidebar) Use boolean queries (capitalize AND,
OR, NOT) lipocalin AND disease Try using
limits Try Links to find Entrez information
and external resources Obtain articles on-line
via Welch Medical Library (and download pdf
files) http//www.welch.jhu.edu/
Page 35
103
lipocalin AND disease (60 results)
1 AND 2
1
2
lipocalin OR disease (1,650,000 results)
1 OR 2
1
2
lipocalin NOT disease (530 results)
1 NOT 2
1
2
Fig. 2.12 Page 34
104
Article contents
globin is absent
globin is present
Search result
false positive (article does not discuss globins)
globin is found
true positive
false negative (article discusses globins)
globin is not found
true negative
105
WelchWeb is available at http//www.welch.jhu.edu
106
WelchWeb is available at http//www.welch.jhu.edu
Welch Medical Library liasons to the basic
sciences
107
November 24, 2008 Pairwise sequence
alignment Jonathan Pevsner,
Ph.D. Bioinformatics Johns Hopkins M.E440.707
108
Outline pairwise alignment

Overview and examples
Definitions homologs, paralogs, orthologs
Assigning scores to aligned amino acids
Dayhoffs PAM matrices
Alignment algorithms Needleman-Wunsch,
Smith-Waterman
Statistical significance of pairwise alignments

109
Pairwise alignments in the 1950s
b-corticotropin (sheep) Corticotropin A (pig)
ala gly glu asp asp glu asp gly ala glu asp glu
CYIQNCPLG CYFQNCPRG
Oxytocin Vasopressin
110
myoglobin
a-
b-
globins
Early example of sequence alignment globins
(1961) H.C. Watson and J.C. Kendrew, Comparison
Between the Amino-Acid Sequences of Sperm Whale
Myoglobin and of Human Hæmoglobin. Nature
190670-672, 1961.
111
Pairwise sequence alignment is the most
fundamental operation of bioinformatics

It is used to decide if two proteins (or genes)
are related structurally or functionally
It is used to identify domains or motifs that
are shared between proteins
It is the basis of BLAST searching (next week)
It is used in the analysis of genomes

112
(No Transcript)
113
(No Transcript)
114
Pairwise alignment protein sequences can be more
informative than DNA

protein is more informative (20 vs 4
characters)
many amino acids share related biophysical
properties
codons are degenerate changes in the third
position
often do not alter the amino acid that is
specified
protein sequences offer a longer look-back
time
DNA sequences can be translated into protein,
and then used in pairwise alignments

115
Page 54
116
Pairwise alignment protein sequences can be more
informative than DNA
DNA can be translated into six potential
proteins
5 CAT CAA 5 ATC AAC 5 TCA ACT
5 CATCAACTACAACTCCAAAGACACCCTTACACATCAACAAACCTACC
CAC 3 3 GTAGTTGATGTTGAGGTTTCTGTGGGAATGTGTAGTTGTT
TGGATGGGTG 5
5 GTG GGT 5 TGG GTA 5 GGG TAG
117
Pairwise alignment protein sequences can be more
informative than DNA

Many times, DNA alignments are appropriate
--to confirm the identity of a cDNA
--to study noncoding regions of DNA
--to study DNA polymorphisms
--example Neanderthal vs modern human DNA

Query 181 catcaactacaactccaaagacacccttacacccactag
gatatcaacaaacctacccac 240

Sbjct 189 catcaactgcaaccccaaagccacccct-caccca
ctaggatatcaacaaacctacccac 247
118
b-lactoglobulin (P02754)
retinol-binding protein 4 (NP_006735)
Page 42
119
Outline pairwise alignment

Overview and examples
Definitions homologs, paralogs, orthologs
Assigning scores to aligned amino acids
Dayhoffs PAM matrices
Alignment algorithms Needleman-Wunsch,
Smith-Waterman
Statistical significance of pairwise alignments

120
Definitions
Pairwise alignment The process of lining up two
sequences to achieve maximal levels of identity
(and conservation, in the case of amino acid
sequences) for the purpose of assessing the
degree of similarity and the possibility of
homology.
121
Definitions
Homology Similarity attributed to descent from a
common ancestor.
Page 42
122
Definitions
Homology Similarity attributed to descent from a
common ancestor.
Identity The extent to which two (nucleotide or
amino acid) sequences are invariant.
RBP 26 RVKENFDKARFSGTWYAMAKKDPEGLFLQDNIVA
59 K GTWMA L
A glycodelin 23 QTKQDLELPKLAGTWHSMAMA-TNNIS
LMATLKA 55
Page 44
123
Definitions two types of homology
Orthologs Homologous sequences in different
species that arose from a common ancestral gene
during speciation may or may not be responsible
for a similar function. Paralogs Homologous
sequences within a single species that arose by
gene duplication.
Page 43
124
common carp
Orthologs members of a gene (protein) family in
various organisms. This tree shows RBP orthologs.
zebrafish
rainbow trout
teleost
African clawed frog
chicken
human
mouse
rat
horse
rabbit
cow
pig
10 changes
Page 43
125
apolipoprotein D
Paralogs members of a gene (protein) family
within a species
retinol-binding protein 4
Complement component 8
Alpha-1 Microglobulin /bikunin
prostaglandin D2 synthase
progestagen- associated endometrial protein
neutrophil gelatinase- associated lipocalin
Odorant-binding protein 2A
10 changes
Lipocalin 1
Page 44
126
(No Transcript)
127
Pairwise alignment of retinol-binding protein 4
and b-lactoglobulin
1 MKWVWALLLLAAWAAAERDCRVSSFRVKENFDKARFSGTWYAMAKK
DPEG 50 RBP . . . . .
.. 1 ...MKCLLLALALTCGAQALIVT..QTMK
GLDIQKVAGTWYSLAMAASD. 44 lactoglobulin 51
LFLQDNIVAEFSVDETGQMSATAKGRVR.LLNNWD..VCADMVGTFTDTE
97 RBP . .
. 45 ISLLDAQSAPLRV.YVEELKPTPEGDLEILLQK
WENGECAQKKIIAEKTK 93 lactoglobulin 98
DPAKFKMKYWGVASFLQKGNDDHWIVDTDYDTYAV...........QYSC
136 RBP . . .
. 94 IPAVFKIDALNENKVL........VLDTDYKK
YLLFCMENSAEPEQSLAC 135 lactoglobulin 137
RLLNLDGTCADSYSFVFSRDPNGLPPEAQKIVRQRQ.EELCLARQYRLIV
185 RBP . .
136 QCLVRTPEVDDEALEKFDKALKALPMHIRLSF
NPTQLEEQCHI....... 178 lactoglobulin
Page 46
128
Definitions
Similarity The extent to which nucleotide or
protein sequences are related. It is based upon
identity plus conservation. Identity The extent
to which two sequences are invariant. Conservatio
n Changes at a specific position of an amino
acid or (less commonly, DNA) sequence that
preserve the physico-chemical properties of the
original residue.
129
Pairwise alignment of retinol-binding protein
and b-lactoglobulin
1 MKWVWALLLLAAWAAAERDCRVSSFRVKENFDKARFSGTWYAMAKK
DPEG 50 RBP . . . . .
.. 1 ...MKCLLLALALTCGAQALIVT..QTMK
GLDIQKVAGTWYSLAMAASD. 44 lactoglobulin 51
LFLQDNIVAEFSVDETGQMSATAKGRVR.LLNNWD..VCADMVGTFTDTE
97 RBP . .
. 45 ISLLDAQSAPLRV.YVEELKPTPEGDLEILLQK
WENGECAQKKIIAEKTK 93 lactoglobulin 98
DPAKFKMKYWGVASFLQKGNDDHWIVDTDYDTYAV...........QYSC
136 RBP . . .
. 94 IPAVFKIDALNENKVL........VLDTDYKK
YLLFCMENSAEPEQSLAC 135 lactoglobulin 137
RLLNLDGTCADSYSFVFSRDPNGLPPEAQKIVRQRQ.EELCLARQYRLIV
185 RBP . .
136 QCLVRTPEVDDEALEKFDKALKALPMHIRLSF
NPTQLEEQCHI....... 178 lactoglobulin
Identity (bar)
Page 46
130
Pairwise alignment of retinol-binding protein
and b-lactoglobulin
1 MKWVWALLLLAAWAAAERDCRVSSFRVKENFDKARFSGTWYAMAKK
DPEG 50 RBP . . . . .
.. 1 ...MKCLLLALALTCGAQALIVT..QTMK
GLDIQKVAGTWYSLAMAASD. 44 lactoglobulin 51
LFLQDNIVAEFSVDETGQMSATAKGRVR.LLNNWD..VCADMVGTFTDTE
97 RBP . .
. 45 ISLLDAQSAPLRV.YVEELKPTPEGDLEILLQK
WENGECAQKKIIAEKTK 93 lactoglobulin 98
DPAKFKMKYWGVASFLQKGNDDHWIVDTDYDTYAV...........QYSC
136 RBP . . .
. 94 IPAVFKIDALNENKVL........VLDTDYKK
YLLFCMENSAEPEQSLAC 135 lactoglobulin 137
RLLNLDGTCADSYSFVFSRDPNGLPPEAQKIVRQRQ.EELCLARQYRLIV
185 RBP . .
136 QCLVRTPEVDDEALEKFDKALKALPMHIRLSF
NPTQLEEQCHI....... 178 lactoglobulin
Very similar (two dots)
Somewhat similar (one dot)
Page 46
131
Definitions
Pairwise alignment The process of lining up two
sequences to achieve maximal levels of identity
(and conservation, in the case of amino acid
sequences) for the purpose of assessing the
degree of similarity and the possibility of
homology.
Page 47
132
Pairwise alignment of retinol-binding protein
and b-lactoglobulin
1 MKWVWALLLLAAWAAAERDCRVSSFRVKENFDKARFSGTWYAMAKK
DPEG 50 RBP . . . . .
.. 1 ...MKCLLLALALTCGAQALIVT..QTMK
GLDIQKVAGTWYSLAMAASD. 44 lactoglobulin 51
LFLQDNIVAEFSVDETGQMSATAKGRVR.LLNNWD..VCADMVGTFTDTE
97 RBP . .
. 45 ISLLDAQSAPLRV.YVEELKPTPEGDLEILLQK
WENGECAQKKIIAEKTK 93 lactoglobulin 98
DPAKFKMKYWGVASFLQKGNDDHWIVDTDYDTYAV...........QYSC
136 RBP . . .
. 94 IPAVFKIDALNENKVL........VLDTDYKK
YLLFCMENSAEPEQSLAC 135 lactoglobulin 137
RLLNLDGTCADSYSFVFSRDPNGLPPEAQKIVRQRQ.EELCLARQYRLIV
185 RBP . .
136 QCLVRTPEVDDEALEKFDKALKALPMHIRLSF
NPTQLEEQCHI....... 178 lactoglobulin
Internal gap
Terminal gap
Page 46
133
Gaps
Positions at which a letter is paired with a
null are called gaps. Gap scores are
typically negative. Since a single mutational
event may cause the insertion or deletion of
more than one residue, the presence of a gap
is ascribed more significance than the length
of the gap. In BLAST, it is rarely necessary
to change gap values from the default.
134
Pairwise alignment of retinol-binding protein
and b-lactoglobulin
1 MKWVWALLLLAAWAAAERDCRVSSFRVKENFDKARFSGTWYAMAKK
DPEG 50 RBP . . . . .
.. 1 ...MKCLLLALALTCGAQALIVT..QTMK
GLDIQKVAGTWYSLAMAASD. 44 lactoglobulin 51
LFLQDNIVAEFSVDETGQMSATAKGRVR.LLNNWD..VCADMVGTFTDTE
97 RBP . .
. 45 ISLLDAQSAPLRV.YVEELKPTPEGDLEILLQK
WENGECAQKKIIAEKTK 93 lactoglobulin 98
DPAKFKMKYWGVASFLQKGNDDHWIVDTDYDTYAV...........QYSC
136 RBP . . .
. 94 IPAVFKIDALNENKVL........VLDTDYKK
YLLFCMENSAEPEQSLAC 135 lactoglobulin 137
RLLNLDGTCADSYSFVFSRDPNGLPPEAQKIVRQRQ.EELCLARQYRLIV
185 RBP . .
136 QCLVRTPEVDDEALEKFDKALKALPMHIRLSF
NPTQLEEQCHI....... 178 lactoglobulin
135
Pairwise alignment of retinol-binding protein
from human (top) and rainbow trout (O. mykiss)
1 .MKWVWALLLLA.AWAAAERDCRVSSFRVKENFDKARFSGT
WYAMAKKDP 48 ...
. .. . 1
MLRICVALCALATCWA...QDCQVSNIQVMQNFDRSRYTGRWYAVAKKDP
47 . . .
. . 49 EGLFLQDNIVAEFSVDETGQMSATAKG
RVRLLNNWDVCADMVGTFTDTED 98
... ..
48 VGLFLLDNVVAQFSVDESGKMTATAHGRVIILNNWEMCANMFGTFE
DTPD 97 . . .
. . 99 PAKFKMKYWGVASFLQKGNDDHW
IVDTDYDTYAVQYSCRLLNLDGTCADS 148
..
98 PAKFKMRYWGAASYLQTGNDDHWVIDTDYDNYAIHYSCR
EVDLDGTCLDG 147 . .
. . . 149
YSFVFSRDPNGLPPEAQKIVRQRQEELCLARQYRLIVHNGYCDGRSERNL
L 199 .. .
148 YSFIFSRHPTGLRPEDQKIVTDKKKEICFLGK
YRRVGHTGFCESS...... 192
136
Pairwise sequence alignment allows us to look
back billions of years ago (BYA)
Origin of life
Origin of eukaryotes
Earliest fossils
Eukaryote/ archaea
Fungi/animal Plant/animal
insects
4
3
2
1
0
Page 48
137
Multiple sequence alignment of glyceraldehyde
3-phosphate dehydrogenases
fly GAKKVIISAP SAD.APM..F VCGVNLDAYK
PDMKVVSNAS CTTNCLAPLA human GAKRVIISAP
SAD.APM..F VMGVNHEKYD NSLKIISNAS CTTNCLAPLA
plant GAKKVIISAP SAD.APM..F VVGVNEHTYQ
PNMDIVSNAS CTTNCLAPLA bacterium GAKKVVMTGP
SKDNTPM..F VKGANFDKY. AGQDIVSNAS CTTNCLAPLA
yeast GAKKVVITAP SS.TAPM..F VMGVNEEKYT
SDLKIVSNAS CTTNCLAPLA archaeon GADKVLISAP
PKGDEPVKQL VYGVNHDEYD GE.DVVSNAS CTTNSITPVA fly
KVINDNFEIV EGLMTTVHAT TATQKTVDGP SGKLWRDGRG
AAQNIIPAST human KVIHDNFGIV EGLMTTVHAI
TATQKTVDGP SGKLWRDGRG ALQNIIPAST plant
KVVHEEFGIL EGLMTTVHAT TATQKTVDGP SMKDWRGGRG
ASQNIIPSST bacterium KVINDNFGII EGLMTTVHAT
TATQKTVDGP SHKDWRGGRG ASQNIIPSST yeast
KVINDAFGIE EGLMTTVHSL TATQKTVDGP SHKDWRGGRT
ASGNIIPSST archaeon KVLDEEFGIN AGQLTTVHAY
TGSQNLMDGP NGKP.RRRRA AAENIIPTST fly
GAAKAVGKVI PALNGKLTGM AFRVPTPNVS VVDLTVRLGK
GASYDEIKAK human GAAKAVGKVI PELNGKLTGM
AFRVPTANVS VVDLTCRLEK PAKYDDIKKV plant
GAAKAVGKVL PELNGKLTGM AFRVPTSNVS VVDLTCRLEK
GASYEDVKAA bacterium GAAKAVGKVL PELNGKLTGM
AFRVPTPNVS VVDLTVRLEK AATYEQIKAA yeast
GAAKAVGKVL PELQGKLTGM AFRVPTVDVS VVDLTVKLNK
ETTYDEIKKV archaeon GAAQAATEVL PELEGKLDGM
AIRVPVPNGS ITEFVVDLDD DVTESDVNAA
Page 49
138
Multiple sequence alignment of human lipocalin
paralogs
EIQDVSGTWYAMTVDREFPEMNLESVTPMTLTTL.GGNLEAKVTM
lipocalin 1 LSFTLEEEDITGTWYAMVVDKDFPEDRRRKVSP
VKVTALGGGNLEATFTF odorant-binding protein
2a TKQDLELPKLAGTWHSMAMATNNISLMATLKAPLRVHITSEDNLEIV
LHR progestagen-assoc. endo. VQENFDVNKYLGRWYEIE
KIPTTFENGRCIQANYSLMENGNQELRADGTV
apolipoprotein D VKENFDKARFSGTWYAMAKDPEGLFLQDNIVAE
FSVDETGNWDVCADGTF retinol-binding
protein LQQNFQDNQFQGKWYVVGLAGNAI.LREDKDPQKMYATIDKS
YNVTSVLF neutrophil gelatinase-ass. VQPNFQQDKFL
GRWFSAGLASNSSWLREKKAALSMCKSVDGGLNLTSTFL
prostaglandin D2 synthase VQENFNISRIYGKWYNLAIGSTCP
WMDRMTVSTLVLGEGEAEISMTSTRW alpha-1-microglobuli
n PKANFDAQQFAGTWLLVAVGSACRFLQRAEATTLHVAPQGSTFRKLD.
.. complement component 8
Page 49

Write a Comment

User Comments (0)