An Example - PowerPoint PPT Presentation

About This Presentation
Title:

An Example

Description:

Ecoli-QOR -PPSLPSGLGTEAAGIVSKVGSGVKHIKAGDRVVYAQSALGAYSSVHNIIADKAAILPAA ... Ecoli-QOR KVDVAEQQKYPLKDAQRAHE-ILESRATQGSSLLIP. Stars indicate identical residues and ... – PowerPoint PPT presentation

Number of Views:54
Avg rating:3.0/5.0
Slides: 76
Provided by: gregsc2
Category:
Tags: ecoli | example

less

Transcript and Presenter's Notes

Title: An Example


1
Computing in Molecular Biology
Hugues Sicotte National Center for Biotechnology
Information sicotte_at_ncbi.nlm.nih.gov
2
C O M P A R A T I V E A N A L Y S I S
Sequence alignment is similar to other types of
comparative analysis
Involves scoring similarities and differences
among a group of related entities
3
Homology
Homology Is the central concept for all of
biology. Whenever we say that a mammalian hormone
is the same hormone as a fish hormone, that a
human gene sequence is the same as a sequence
in a chimp or a mouse, that a HOX gene is the
same in a mouse, a fruit fly, a frog and a
human - even when we argue that discoveries about
a worm, a fruit fly, a frog, a mouse, or a chimp
have relevance to the human condition - we have
made a bold and direct statement about homology.
The aggressive confidence of modern biomedical
science implies that we know what we are talking
about. David B. Wake
4
C O M P A R A T I V E A N A L Y S I S
GATTACCA
Alignment algorithms model evolutionary processes
Derivation from a common ancestor through
incremental change due to dna replication errors,
mutations, damage, or unequal crossing-over.
insertion
deletion
Substitution
5
C O M P A R A T I V E A N A L Y S I S
GATTACCA
Alignment algorithms model evolutionary processes
GATGACCA
GATTACCA
GATTACCA
GATTATCA
GATTACCA
GATTACCA
Derivation from a common ancestor through
incremental change
GATCATCA
GATTGATCA
GATACCA
GATCATCA
GATTGATCA
GATACCA
Only extant sequences are known, ancestral
sequences are postulated.
6
C O M P A R A T I V E A N A L Y S I S
GATTACCA
Alignment algorithms model evolutionary processes
GATGACCA
GATTACCA
GATTACCA
GATTATCA
GATTACCA
Derivation from a common ancestor through
incremental change. Mutations that do not kill
the host may carry over to the population. Rarely
are mutations kept/rejected by natural selection.
GATCATCA
GATTGATCA
GATACCA
The term homology implies a common ancestry,
which may be inferred from observations of
sequence similarity
7
Comparative Analysis of Genes
Align Extant Sequences
MSH2_Human TGVIVLMAQIGCFVPCESAEVSIVDCILARVGAGDSQLK
GVSTFMAEMLETASILRSATK SPE1_DROME
VGTAVLMAHIGAFVPCSLATISMVDSILGRVGASDNIIKGLSTFMVEMIE
TSGIIRTATD MSH2_Yeast VGVISLMAQIGCFVPCEEAEIAIVDAIL
CRVGAGDSQLKGVSTFMVEILETASILKNASK MUTS_ECOLI
TALIALMAYIGSYVPAQKVEIGPIDRIFTRVGAADDLASGRSTFMVEMTE
TANILRNATE

Human Colon Cancer MSH2 gene is homologous to DNA
repair proteins
8
Why Align sequences? - Finding similar sequences
helps determine the properties and function of a
new sequence. (Must be verified
experimentally) -Conserved positions in
homologous sequences hint at functionally
important sites in proteins. (active or catalytic
sites, dna binding domains, di-sulfide bridges,
structural bends, hydrophobic pockets, protein
binding domains,) -Conserved nucleotides can
hint at regulatory elements, either
pre-transcriptional or post-transcriptional.
9
Sound alignment methods reflect evolution.
DNA Evolution - Mutation Errors in DNA
replication of DNA repair. -substitutions
replacement of one base by another. -deletions/in
sertions By dna mispairing during replication
or unequal crossing over. - Gene conversion or
unequal crossing over Large segments of DNA
can be inserted/deleted. - Mutations that do not
kill the host are propagated. Sometimes positive
mutations are selected for.
Reference Molecular Evolution Wen-Hsiung Li,
1997,Sinauer Associates publishing
10
Synonymous versus non-synonymous mutations
Different regions evolve at different rates,
consistent with evolutionary constraints.
Substitution rate per nucleotide site per billion
years.
11
Alignment definition and Type
Alignment
Each Base is used at most once.
Global Alignment
All bases aligned with another base or with
a gap (symbol of - or
sometimes .).
G-ATES GRATED

Local Alignments
Do not need to align all the bases in all
sequences.
Align BILLGATESLIKESCHEESE and GRATEDCHEESE
G-ATESLIKESCHEESE or G-ATES
CHEESE GRATED-----CHEESE GRATED CHEESE
12
C O M P A R A T I V E A N A L Y S I S
Insertions and deletions (indels) are
represented by gaps in alignments
GATTATACCA
GATTA---CA
gap of length 3
13
S E Q U E N C E A L I G N M E N T
Alignment of trypsin sequences from mouse and
crayfish
An alignment provides a mapping of residues in
one sequence onto those of another
S-S

Mouse IVGGYNCEENSVPYQVSLNS-----GYHFCGGSLIN
EQWVVSAGHCYK-------SRIQV Crayfish
IVGGTDAVLGEFPYQLSFQETFLGFSFHFCGASIYNENYAITAGHCVYGD
DYENPSGLQI
Mouse RLGEHNIEVLEGNEQFINAAKIIRHPQYDRKTL
NNDIMLIKLSSRAVINARVSTISLPTA Crayfish
VAGELDMSVNEGSEQTITVSKIILHENFDYDLLDNDISLLKLSGSLTFNN
NVAPIALPAQ Mouse PPATGTKCLISGWGNTASSGADYPDELQ
CLDAPVLSQAKCEASYPG-KITSNMFCVGFLE Crayfish
GHTATGNVIVTGWG-TTSEGGNTPDVLQKVTVPLVSDAECRDDYGADEIF
DSMICAGVPE Mouse
GGKDSCQGDSGGPVVCNG----QLQGVVSWGDGCAQKNKPGVYTKVYNYV
KWIKNTIAAN Crayfish GGKDSCQGDSGGPLAASDTGSTYLAGIVSW
GYGCARPGYPGVYTEVSYHVDWIKANAV--
S-S
Conserved residues are often of structural or
functional importance
S-S
Figure 7.1
14
S E Q U E N C E A L I G N M E N T
Alignment of trypsin sequences from mouse and
crayfish
S-S

Mouse IVGGYNCEENSVPYQVSLNS-----GYHFCGGSLIN
EQWVVSAGHCYK-------SRIQV Crayfish
IVGGTDAVLGEFPYQLSFQETFLGFSFHFCGASIYNENYAITAGHCVYGD
DYENPSGLQI
Mouse RLGEHNIEVLEGNEQFINAAKIIRHPQYDRKTL
NNDIMLIKLSSRAVINARVSTISLPTA Crayfish
VAGELDMSVNEGSEQTITVSKIILHENFDYDLLDNDISLLKLSGSLTFNN
NVAPIALPAQ Mouse PPATGTKCLISGWGNTASSGADYPDELQ
CLDAPVLSQAKCEASYPG-KITSNMFCVGFLE Crayfish
GHTATGNVIVTGWG-TTSEGGNTPDVLQKVTVPLVSDAECRDDYGADEIF
DSMICAGVPE Mouse
GGKDSCQGDSGGPVVCNG----QLQGVVSWGDGCAQKNKPGVYTKVYNYV
KWIKNTIAAN Crayfish GGKDSCQGDSGGPLAASDTGSTYLAGIVSW
GYGCARPGYPGVYTEVSYHVDWIKANAV--
S-S
S-S
Conserved positions are often of functional
importance. Alignment of trypsin proteins of
mouse (Swiss-Prot P07146) and crayfish
(Swiss-Prot P00765). Identical residues are
highlighted red and underlined. Indicated above
the alignment are three disulfide bonds (-S-S-)
whose participating cysteine residues are
conserved, amino acids whose side chains are
involved in the charge relay system (asterisk)
and the active side residue which governs
substrate specificity (diamond). The other
conserved positions have no known role. These
conserved residues could be coincidentally
conserved or have some unknown structural role.
Figure 7.1
15
S E Q U E N C E A L I G N M E N T
Human zeta crystallin vs E.coli quinone
oxidoreductase
Stars indicate identical residues and dots
indicate conservative substitutions
CLUSTAL W (1.7) multiple sequence
alignment Human-Zcr MATGQKLMRAVRVFEFGGPEVLKLR
SDIAVPIPKDHQVLIKVHACGVNPVETYIRSGTYS Ecoli-QOR
------MATRIEFHKHGGPEVLQA-VEFTPADPAENEIQVENKAIGINFI
DTYIRSGLYP ....
. . Human-Zcr
RKPLLPYTPGSDVAGVIEAVGDNASAFKKGDRVFTSSTISGGYAEYALAA
DHTVYKLPEK Ecoli-QOR -PPSLPSGLGTEAAGIVSKVGSGVKH
IKAGDRVVYAQSALGAYSSVHNIIADKAAILPAA
.. .... . . ..
... Human-Zcr LDFKQGAAIGIPYFTAYRALIHSACV
KAGESVLVHGASGGVGLAACQIARAYGLKILGTA Ecoli-QOR
ISFEQAAASFLKGLTVYYLLRKTYEIKPDEQFLFHAAAGGVGLIACQWAK
ALGAKLIGTV .. .
...... . Human-Zcr
GTEEGQKIVLQNGAHEVFNHREVNYIDKIKKYVGEKGIDIIIEMLANVNL
SKDLSLLSHG Ecoli-QOR GTAQKAQSALKAGAWQVINYREEDLV
ERLKEITGGKKVRVVYDSVGRDTWERSLDCLQRR
. . .. .
... . Human-Zcr GRVIVVG-SRGTIEINPRDTMAKES
----SIIGVTLFSSTKEEFQQYAAALQAGMEIGWL Ecoli-QOR
GLMVSFGNSSGAVTGVNLGILNQKGSLYVTRPSLQGYITTREELTEASNE
LFSLIASGVI . . .
. Human-Zcr
KPVIGSQ--YPLEKVAEAHENIIHGSGATGKMILLL Ecoli-QOR
KVDVAEQQKYPLKDAQRAHE-ILESRATQGSSLLIP
.. .. . .. . .
Figure 7.2
16
Score and Statistics
Percent Identity. Can be
misleading. Score A
simple quality measure is the score. The
score assigns points for each aligned base
(or gap) of the alignment. identical
bases match score mismatching bases
mismatch score gaps
gap opening penalty for starting a gap
gap extension penalty
for each gap symbol.
Example match 1 ,
mismatch -1, gap opening -5, gap
extension-1
G-ATESLIKESCHEESE AND/OR G-ATES
CHEESE GRATED-----CHEESE GRATED CHEESE
Score 10(1)1(-1)(-5-1)(-55(-1))
-7
17
S C O R I N G S Y S T E M S
Which alignment is better?
GCTACTAG-T-T--CGC-T-TAGCGCTACTAGCTCTAGCGCGTATAGC
0 mismatches, 5 gaps
GCTACTAGTT------CGCTTAGCGCTACTAGCTCTAGCGCGTATAGC
3 mismatches, 1 gap
18
S C O R I N G S Y S T E M S
High penalty for opening a gap (e.g. G 5)
GCTACTAG-T-T--CGC-T-TAGCGCTACTAGCTCTAGCGCGTATAGC
Penalty 5G 6L 31
Lower penalty for entending a gap (e.g. L 1)
GCTACTAGTT------CGCTTAGCGCTACTAGCTCTAGCGCGTATAGC
Penalty 1G 6L 11
19
L O C A L S I M I L A R I T Y
Protein modules in coagulation factor XII (F12)
and tissue plasminogen activator (PLAT)
Mix-and-match protein modules confound alignment
algorithms
F1,F2 Fibronectin repeatsE EGF similarity
domainK Kringle domainCatalytic Serine protease
activitiy
Figure 7.3
20
L O C A L S I M I L A R I T Y
Protein modules in coagulation factor XII (F12)
and tissue plasminogen activator (PLAT)
Mix-and-match protein modules confound alignment
algorithms
modules inreverse order
F1,F2 Fibronectin repeatsE EGF similarity
domainK Kringle domainCatalytic Serine protease
activitiy
Figure 7.3
21
L O C A L S I M I L A R I T Y
Protein modules in coagulation factor XII (F12)
and tissue plasminogen activator (PLAT)
Mix-and-match protein modules confound alignment
algorithms
repeatedmodules
F1,F2 Fibronectin repeatsE EGF similarity
domainK Kringle domainCatalytic Serine protease
activitiy
Figure 7.3
22
D O T P L O T S
Dot-plot Fitch Biochem. Genet. (1969)3,99-108
Horizontal axis is coordinates for one sequence
Vertical axis is coordinates for the other
Figure 7.4
23
D O T P L O T S
Dot-plot Fitch Biochem. Genet.
(1969)3,99-108 Can also score not 1 position at a
time, but in sliding window. For example a window
of 3 nucleotides where we score 1 for identical
triplets and 0 for all other combinations yields.
Horizontal axis is coordinates for one sequence
C G T A C C G T
A
0
0
0
0
0
0
Vertical axis is coordinates for the other
C
1
0
0
0
0
1
G
T
Figure 7.4b
24
D O T P L O T S
Coagulation Factor XII (F12)
Horizontal axis is coordinates for one sequence
Vertical axis is coordinates for the other
Tissue Plasminogen Activator (PLAT)
Figure 7.4
25
D O T P L O T S
Coagulation Factor XII (F12)
Plot dots for high similarity within a short
window
Adjacent dots merge to form diagonal segments
Tissue Plasminogen Activator (PLAT)
Figure 7.4
26
D O T P L O T S
Coagulation Factor XII (F12)
Repeated domains show a characteristic pattern
F1
E
K
K
Tissue Plasminogen Activator (PLAT)
Catalytic
K
Catalytic
E
F1
E
F2
Figure 7.4
27
P A T H G R A P H S
Dot plots suggest paths through the alignment
space
EGF similarity domains of urokinse plasminogen
activator (PLAU) and tissue plasminogen
activator (PLAT)
90
137
23
Path graphs are more explicit representations
Each path is a unique alignment
72
PLAU 90 EPKKVKDHCSKHSPCQKGGTCVNMP--SGPH-CLCPQHLT
GNHCQKEK---CFE 137PLAT 23 ELHQVPSNCD----CLNGGT
CVSNKYFSNIHWCNCPKKFGGQHCEIDKSKTCYE 72
Figure 7.5
28
P A T H G R A P H S
EGF similarity domains of urokinse plasminogen
activator (PLAU) and tissue plasminogen
activator (PLAT)
Dot plots suggest paths through the alignment
space
90
137
90
137
23
23
Path graphs are explicit representations of
alignments
72
72
PLAU 90 EPKKVKDHCSKHSPCQKGGTCVNMP--SGPH-CLCPQHLT
GNHCQKEK---CFE 137PLAT 23 ELHQVPSNCD----CLNGGT
CVSNKYFSNIHWCNCPKKFGGQHCEIDKSKTCYE 72
Figure 7.5
29
P A T H G R A P H S
Best-path problems are common in computer science
A best-path algorithm used for sequence alignment
is called dynamic programming
30
D Y N A M I C P R O G R A M M I N G
Dynamic Programming Example
G A T A C T A
Construct an optimal of these two sequences
G A T T A C C A
Using these scoring rules
Match
1
Mismatch
-1
Gap
-1
31
D Y N A M I C P R O G R A M M I N G
G
A
T
A
C
T
A
Arrange the sequence residues along a
two-dimensional lattice
G
A
T
T
A
Vertices of the lattice fall between letters
C
C
A
32
D Y N A M I C P R O G R A M M I N G
G
A
T
A
C
T
A
The goal is to find the optimal path
G
A
from here
T
T
A
C
to here
C
A
33
D Y N A M I C P R O G R A M M I N G
G
A
T
A
C
T
A
Each path corresponds to a unique alignment
G
A
T
T
Which one is optimal?
A
C
C
A
34
D Y N A M I C P R O G R A M M I N G
G
A
T
A
C
T
A
The score for a path is the sum of its
incremental edges scores
G
A aligned with A
A
Match 1
T
T
A
C
C
A
35
D Y N A M I C P R O G R A M M I N G
G
A
T
A
C
T
A
The score for a path is the sum of its
incremental edges scores
G
A
A aligned with T
T
Mismatch -1
T
A
C
C
A
36
D Y N A M I C P R O G R A M M I N G
G
A
T
A
C
T
A
The score for a path is the sum of its
incremental edges scores
G
T aligned with NULL
A
Gap -1
T
T
NULL aligned with T
A
C
C
A
37
D Y N A M I C P R O G R A M M I N G
G
A
T
A
C
T
A
Incrementally extend the path
0
-1
G
1
-1
A
T
T
A
C
C
A
38
D Y N A M I C P R O G R A M M I N G
G
A
T
A
C
T
A
Incrementally extend the path
0
-2
-1
G
1
-1
-2
A
Remember the best sub-path leading to each point
on the lattice
T
T
A
C
C
A
39
D Y N A M I C P R O G R A M M I N G
G
A
T
A
C
T
A
Incrementally extend the path
0
-2
-1
G
-1
1
-2
0
A
Remember the best sub-path leading to each point
on the lattice
0
2
T
T
A
C
C
A
40
D Y N A M I C P R O G R A M M I N G
G
A
T
A
C
T
A
Incrementally extend the path
0
-2
-1
G
1
-2
0
-1
A
Remember the best sub-path leading to each point
on the lattice
0
2
-2
T
T
A
C
C
A
41
D Y N A M I C P R O G R A M M I N G
G
A
T
A
C
T
A
Incrementally extend the path
0
-1
-2
-3
G
1
-2
-1
-1
0
A
Remember the best sub-path leading to each point
on the lattice
-2
0
1
2
T
1
-3
3
-1
T
A
C
C
A
42
D Y N A M I C P R O G R A M M I N G
G
A
T
A
C
T
A
Incrementally extend the path
0
-1
-2
-4
-5
-3
G
0
1
-1
-1
-3
-2
A
Remember the best sub-path leading to each point
on the lattice
0
-2
0
1
2
-1
T
1
-3
1
-1
2
3
T
0
2
-2
2
-4
1
A
-5
-1
1
3
2
-3
C
C
A
43
D Y N A M I C P R O G R A M M I N G
G
A
T
A
C
T
A
Incrementally extend the path
G
A
Remember the best sub-path leading to each point
on the lattice
T
T
A
C
C
A
44
D Y N A M I C P R O G R A M M I N G
G
A
T
A
C
T
A
Trace-back to get optimal path and alignment
G
A
T
T
A
C
C
A
45
D Y N A M I C P R O G R A M M I N G
G
A
T
A
C
T
A
Print out the alignment
G
A
A A
- T
T T
A A
C C
T C
A A
G G
T
T
A
C
C
A
46
Two different types of Alignment
Needleman Wunch (J. Mol. Biol. (1970)
48,443-453 Problem of finding the best path.
Revelation Any partial sub-path that ends at a
point along the true optimal path must itself be
the optimal path leading to that point. This
provides a method to create a matrix of path
score, the score of a path leading to that
point. Trace the optimal path from one end to the
other of the two sequences.
Global Alignment methods
Smith Waterman.(J. Mol. Biol. (1981),
147,195-197 Use Needleman Wunch, but report all
non-overlapping paths, starting at the highest
scoring points in the path graph. FASTP(Lipman
Pearson(1985),Science 227,1435-1441 BLAST
(Altschul et al (1990),J. Mol. Bio. 215,408-410)
dont report all overlapping paths, but only
attempt to find paths if there are words that are
high-scoring. Speeds up considerably the
alignments.
Local Alignment methods
47
G L O B A L L O C A L S I M I L A R I T Y
Implementations of dynamic programming for global
and local similarities
48
Score and Statistics
Some amino acids mutations do not affect
structure/function very much. Amino acids with
similar physico-chemical and steric properties
can often replace each other. Scoring system
that doesnt penalize very much mutations to
similar amino acid. PAM Matrices Point Accepted
Mutations. Defined in terms of a divergence of 1
percent PAM. For distant sequences use PAM250,
while for closer sequences (like DNA) use PAM100.
Some sites accumulate mutations some others
dont, thus use of the PAM100 matrice doesnt
mean that the sequences compared were 100
mutated. BLOSUM BLOCK substitution matrices.
Started with the BLOCKS database of multiple
alignment only involving distant sequences.
BLOSUM62 means that the proteins compated were
never closer than 62 Identity. BLOSUM50
matrices involved alignment of more distant
sequences. Recommend use BLOSUM matrices
(BLOSUM62) for most protein alignments.
49
S C O R I N G S Y S T E M S
A 4 R -1 5 N -2 0 6 D -2 -2 1 6 C 0 -3 -3
-3 9 Q -1 1 0 0 -3 5 E -1 0 0 2 -4 2
5 G 0 -2 0 -1 -3 -2 -2 6 H -2 0 1 -1 -3 0
0 -2 8 I -1 -3 -3 -3 -1 -3 -3 -4 -3 4 L -1 -2
-3 -4 -1 -2 -3 -4 -3 2 4 K -1 2 0 -1 -3 1 1
-2 -1 -3 -2 5 M -1 -1 -2 -3 -1 0 -2 -3 -2 1 2
-1 5 F -2 -3 -3 -3 -2 -3 -3 -3 -1 0 0 -3 0
6 P -1 -2 -2 -1 -3 -1 -1 -2 -2 -3 -3 -1 -2 -4
7 S 1 -1 1 0 -1 0 0 0 -1 -2 -2 0 -1 -2 -1
4 T 0 -1 0 -1 -1 -1 -1 -2 -2 -1 -1 -1 -1 -2 -1
1 5 W -3 -3 -4 -4 -2 -2 -3 -2 -2 -3 -2 -3 -1 1
-4 -3 -2 11 Y -2 -2 -2 -3 -2 -1 -2 -3 2 -1 -1 -2
-1 3 -3 -2 -2 2 7 V 0 -3 -3 -3 -1 -2 -2 -3 -3
3 1 -2 1 -1 -2 -2 0 -3 -1 4 A R N D C
Q E G H I L K M F P S T W Y V
Some amino acid substitutions are more common
than others
BLOSUM62
Substitution scores come from an odds ratio based
on measured substitution rates
Figure 7.8
50
S C O R I N G S Y S T E M S
A 4 R -1 5 N -2 0 6 D -2 -2 1 6 C 0 -3 -3
-3 9 Q -1 1 0 0 -3 5 E -1 0 0 2 -4 2
5 G 0 -2 0 -1 -3 -2 -2 6 H -2 0 1 -1 -3 0
0 -2 8 I -1 -3 -3 -3 -1 -3 -3 -4 -3 4 L -1 -2
-3 -4 -1 -2 -3 -4 -3 2 4 K -1 2 0 -1 -3 1 1
-2 -1 -3 -2 5 M -1 -1 -2 -3 -1 0 -2 -3 -2 1 2
-1 5 F -2 -3 -3 -3 -2 -3 -3 -3 -1 0 0 -3 0
6 P -1 -2 -2 -1 -3 -1 -1 -2 -2 -3 -3 -1 -2 -4
7 S 1 -1 1 0 -1 0 0 0 -1 -2 -2 0 -1 -2 -1
4 T 0 -1 0 -1 -1 -1 -1 -2 -2 -1 -1 -1 -1 -2 -1
1 5 W -3 -3 -4 -4 -2 -2 -3 -2 -2 -3 -2 -3 -1 1
-4 -3 -2 11 Y -2 -2 -2 -3 -2 -1 -2 -3 2 -1 -1 -2
-1 3 -3 -2 -2 2 7 V 0 -3 -3 -3 -1 -2 -2 -3 -3
3 1 -2 1 -1 -2 -2 0 -3 -1 4 A R N D C
Q E G H I L K M F P S T W Y V
Identities get positive scores, but some are
better than others
BLOSUM62
Figure 7.8
51
S C O R I N G S Y S T E M S
A 4 R -1 5 N -2 0 6 D -2 -2 1 6 C 0 -3 -3
-3 9 Q -1 1 0 0 -3 5 E -1 0 0 2 -4 2
5 G 0 -2 0 -1 -3 -2 -2 6 H -2 0 1 -1 -3 0
0 -2 8 I -1 -3 -3 -3 -1 -3 -3 -4 -3 4 L -1 -2
-3 -4 -1 -2 -3 -4 -3 2 4 K -1 2 0 -1 -3 1 1
-2 -1 -3 -2 5 M -1 -1 -2 -3 -1 0 -2 -3 -2 1 2
-1 5 F -2 -3 -3 -3 -2 -3 -3 -3 -1 0 0 -3 0
6 P -1 -2 -2 -1 -3 -1 -1 -2 -2 -3 -3 -1 -2 -4
7 S 1 -1 1 0 -1 0 0 0 -1 -2 -2 0 -1 -2 -1
4 T 0 -1 0 -1 -1 -1 -1 -2 -2 -1 -1 -1 -1 -2 -1
1 5 W -3 -3 -4 -4 -2 -2 -3 -2 -2 -3 -2 -3 -1 1
-4 -3 -2 11 Y -2 -2 -2 -3 -2 -1 -2 -3 2 -1 -1 -2
-1 3 -3 -2 -2 2 7 V 0 -3 -3 -3 -1 -2 -2 -3 -3
3 1 -2 1 -1 -2 -2 0 -3 -1 4 A R N D C
Q E G H I L K M F P S T W Y V
Some non-identities have positive scores, but
most are negative
BLOSUM62
Figure 7.8
52
D A T A B A S E S E A R C H I N G
Compare one query sequence against an entire
database
gt
fasta
myquery
swissprot
-ktup 2
search program
querysequence
sequencedatabase
optionalparameters
A typical search has four basic elements
53
D A T A B A S E S E A R C H I N G
With exponential database growth, searches keep
taking more time
gt
fasta
myquery
swissprot
-ktup 2
searching
.
.
.
.
.
.
54
D A T A B A S E S E A R C H I N G
The hit list gives titles and scores for
matched sequences
gt
fasta
myquery
swissprot
-ktup 2
The best scores are
initn init1 opt z-sc E(77110) gi1706794spP4978
9FHIT_HUMAN BIS(5'-ADENOSYL)- 996 996 996
1262.1 0 gi1703339spP49776APH1_SCHPO
BIS(5'-NUCLEOSYL) 412 382 395 507.6
1.4e-21 gi1723425spP49775HNT2_YEAST HIT
FAMILY PROTEI 238 133 316 407.4
5.4e-16 gi3915958spQ58276Y866_METJA
HYPOTHETICAL HIT- 153 98 190 253.1
2.1e-07 gi3916020spQ11066YHIT_MYCTU
HYPOTHETICAL 15.7 163 163 184 244.8
6.1e-07 gi3023940spO07513HIT_BACSU HIT
PROTEIN 164 164 170 227.2
5.8e-06 gi2506515spQ04344HNT1_YEAST HIT
FAMILY PROTEI 130 91 157 210.3
5.1e-05 gi2495235spP75504YHIT_MYCPN
HYPOTHETICAL 16.1 125 125 148 199.7
0.0002 gi418447spP32084YHIT_SYNP7
HYPOTHETICAL 12.4 42 42 140 191.3
0.00058 gi3025190spP94252YHIT_BORBU
HYPOTHETICAL 15.9 128 73 139 188.7
0.00082 gi1351828spP47378YHIT_MYCGE
HYPOTHETICAL HIT- 76 76 133 181.0
0.0022 gi418446spP32083YHIT_MYCHR
HYPOTHETICAL 13.1 27 27 119 165.2
0.017 gi1708543spP49773IPK1_HUMAN HINT
PROTEIN (PRO 66 66 118 163.0
0.022 gi2495231spP70349IPK1_MOUSE HINT
PROTEIN (PRO 65 65 116 160.5
0.03 gi1724020spP49774YHIT_MYCLE HYPOTHETICAL
HIT- 52 52 117 160.3 0.031 gi1170581spP164
36IPK1_BOVIN HINT PROTEIN (PRO 66 66 115
159.3 0.035 gi2495232spP80912IPK1_RABIT HINT
PROTEIN (PRO 66 66 112 155.5
0.057 gi1177047spP42856ZB14_MAIZE 14 KD
ZINC-BINDIN 73 73 112 155.4
0.058 gi1177046spP42855ZB14_BRAJU 14 KD
ZINC-BINDIN 76 76 110 153.8
0.072 gi1169825spP31764GAL7_HAEIN
GALACTOSE-1-PHOSP 58 58 104 138.5
0.51 gi113999spP16550APA1_YEAST
5',5'''-P-1,P-4-TE 47 47 103 137.8
0.56 gi1351948spP49348APA2_KLULA
5',5'''-P-1,P-4-T 63 63 98 131.3
1.3 gi123331spP23228HMCS_CHICK
HYDROXYMETHYLGLUTA 58 58 99 129.4
1.6 gi1170899spP06994MDH_ECOLI MALATE
DEHYDROGENA 70 48 91 122.9
3.7 gi3915666spQ10798DXR_MYCTU
1-DEOXY-D-XYLULOSE 75 50 92 121.9
4.3 gi124341spP05113IL5_HUMAN INTERLEUKIN-5
PRECU 36 36 85 121.3 4.7 gi1170538spP46
685IL5_CERTO INTERLEUKIN-5 PREC 36 36 84
120.0 5.5 gi121369spP15124GLNA_METCA
GLUTAMINE SYNTHETA 45 45 90 118.9
6.3 gi2506868spP33937NAPA_ECOLI PERIPLASMIC
NITRA 48 48 92 117.4 7.6 gi119377spP104
03ENV1_DROME RETROVIRUS-RELATED 59 59 89
117.0 8 gi1351041spP48415SC16_YEAST
MULTIDOMAIN VESIC 48 48 97 117.0
8 gi4033418spO67501IPYR_AQUAE INORGANIC
PYROPHO 38 38 83 116.8 8.3
55
E-value
Hits can be sorted according to their E-value
or their score. The E-value is better known as
the EXPECT value and is a function of score,
database size and query sequence length. E-value
Number of alignments with a score gtS that you
expect to find if the database was a collection
of random letters. e.g. For a score of 1, one
only requires 1 match, and there should be an
enormous amount of alignments. One expects to
find less alignments with a score of 5, and so
on.. Eventually when the score is big enough, one
expects to find an insignificant number of of
alignments that could be due to chance. E-value
of less than 1e-6 (1 10-6 in scientific
notation) are usually very good and for proteins,
Elt1e-2 is usually considered significant. It is
still possible for a Hit with Egt1 to be
biologically meaningful, but more analysis is
required to comfirm that. Even for VERY good
hits, it is possible that the hit is due to a
biological artifact (sequencing/cloning vector,
repeats, low-complexity sequence)
56
E-value
Another type of statistics is the P-value, which
given a score S for an alignment is the
Probability that an alignment of the query
against a database of random sequences has a
score gt S.For gapless alignments the P-value can
be computed from theory. Sometimes one has an
alignments algorithms, or biologically complex
databases that do not allow the computation of
P-value based on the statistical theory of a
uniform database. In this case, one computes uses
an alternate statistics, the Z-value (e.g. FASTA
suite), which shuffles the query sequence and
thus creates many compositionally identical query
sequence. Each random sequences is then
re-queried agains the database. When done enough
times, this provides a distribution of scores
which is approximately normally distributed (if
lucky) around some mean. Z-value score
distance away from mean/ standard devuation .. a
Z-value of 3 or greater is good.
Standard deviation
Prob Distrib
S score of alignment
Deviation from mean
Score
57
D A T A B A S E S E A R C H I N G
Detailed alignments are shown farther down in the
output
gt
fasta
myquery
swissprot
-ktup 2
gtgtgi1703339spP49776APH1_SCHPO
BIS(5'-NUCLEOSYL)-TETR (182 aa) initn 412
init1 382 opt 395 z-score 507.6 E()
1.4e-21 Smith-Waterman score 395 52.3
identity in 109 aa overlap 10
20 30 40 50 gi170
MSFRFGQHLIKPSVVFLKTELSFALVNRKPVVPGHVLVCPLRPVERFHDL
RPDEVADLF X .. .
.. .. ... gi170
MPKQLYFSKFPVGSQVFYRTKLSAAFVNLKPILPGHVLVIPQRAVPRLKD
LTPSELTDLF 10 20 30
40 50 60 60 70
80 90 100 110 gi170
QTTQRVGTVVEKHFHGTSLTFSMQDGPEAGQTVKHVHVHVLPRKAGDFHR
NDSIYEELQK .... . ... ....
. .. . . . X. gi170
TSVRKVQQVIEKVFSASASNIGIQDGVDAGQTVPHVHVHIIPRKKADFSE
NDLVYSELEK 70 80 90
100 110 120 120 130
140 gi170 HDKEDFPASWRSEEEMAAEAAALRVYFQ
.. gi170 NEGNLASLYLTGNERYAGDERPPTSMRQAIPKDEDRKP
RTLEEMEKEAQWLKGYFSEEQE 130
140 150 160 170
180 gtgtgi1723425spP49775HNT2_YEAST HIT FAMILY
PROTEIN 2 (217 aa) initn 238 init1 133 opt
316 z-score 407.4 E() 5.4e-16 Smith-Waterman
score 316 37.4 identity in 131 aa overlap
10 20 30
40 gi170 MSFRFGQHLIKPSVVFLKTE
LSFALVNRKPVVPGHVLVCPLRP-VER
.. . .v .. .. .. X
58
H A S H I N G M E T H O D S
Simplest Database searching could is a large
dynamic programming example. For a query of N
letters against a database of M letters, it
requires MxN comparisons.
Query sequence
Database sequence
59
H A S H I N G M E T H O D S
Hashing is a common method for accelerating
database searches
MLI
LII
IIK
IKR
all overlappingwords of size 3
Compile dictionary of words from the query
sequence. Put each word in a look-up table that
points to the original position in the sequence.
Thus given one word, you can know if it is in the
query in a single operation.
KRD
RDE
DEL
ELV
LVI
VIS
ISW
SWA
WAS
ASH
SHE
HER
ERE
60
Index lookup
  • Each word is assigned a unique integer.
  • E.g. for a word of 3 letters made up of an
    alphabet of 20 letters.
  • Assign a code to each letter Code(l) (0 to 19)
  • For a word of 3 letters L1 L2 L3 the code is
  • index Code(L1)202 Code(L2)201 Code(L3)
  • 3. Have an array with a list of the positions
    that have that word.

AAA
AAB
MLI
MLJ
0
1
2
3
1
Position in query sequence of word
61
H A S H I N G M E T H O D S
Building the dictionary for the query sequence
requires (N-2) operations.
MLI
LII
IIK
IKR
all overlappingwords of size 3
KRD
RDE
DEL
ELV
The database contains (M-2) words, and it takes
only one operation to see if the word was in the
query.
LVI
VIS
ISW
SWA
WAS
ASH
SHE
HER
ERE
62
H A S H I N G M E T H O D S
Query sequence
Scan the database, looking up words in the
dictionary
Use word hits to determine were to search for
alignments fills the dynamic programming
matrix in (N-2)(M-2) operations instead of MxN.
Database sequence
63
H A S H I N G M E T H O D S
Query sequence
Scan the database, looking up words in the
dictionary
Use word hits to determine were to search for
alignments
Database sequence
FASTA searches in a band
64
H A S H I N G M E T H O D S
Query sequence
Scan the database, looking up words in the
dictionary
Use word hits to determine were to search for
alignments
Database sequence
BLAST extends from word hits
65
Multiple Alignment
FHIT_HUMAN MSFRFGQHLIKPSVVFLKTELSFALVNRKPVVPGHVLV.
.. APH1_SCHPO MPKQLYFSKFPVGSQVFYRTKLSAAFVNLKPILPGH
VLV... HNT2_YEAST MILSKTKKPKSMNKPIYFSKFLVTEQVFYKSK
YTYALVNLKPIV PGHVLI... Y866_METJA
MCIFCKIINGEIPAKVVYEDEHVLAFLDINPRNKGHTLV...
FHIT_HUMAN -----------MS-F RFGQHLIKP-SVVFL
KTELSFALVNRKPVV PGHVLV... APH1_SCHPO
-----------MPKQ LYFSKFPVG-SQVFY RTKLSAAFVNLKPIL
PGHVLV... HNT2_YEAST MILSKTKKPKSMNKP
IYFSKFLVT-EQVFY KSKYTYALVNLKPIV
PGHVLI... Y866_METJA -----------MCIF
CKIINGEIP-AKVVY EDEHVLAFLDINPRN KGHTLV...
A true multiple alignment method will align all
the sequences together at the same time.
66
Multiple Alignment
FHIT_HUMAN -----------MS-F RFGQHLIKP-SVVFL
KTELSFALVNRKPVV PGHVLV... APH1_SCHPO
-----------MPKQ LYFSKFPVG-SQVFY RTKLSAAFVNLKPIL
PGHVLV... HNT2_YEAST MILSKTKKPKSMNKP
IYFSKFLVT-EQVFY KSKYTYALVNLKPIV
PGHVLI... Y866_METJA -----------MCIF
CKIINGEIP-AKVVY EDEHVLAFLDINPRN KGHTLV...
A true multiple alignment method will align all
the sequences together at the same time.
Unfortunately, there is no formal computationally
tractable method for more than 3 sequences.
There are many approximate methods, such as
Progressive multiple alignment methods.
67
Progressive Multiple Alignment
Align all pairs of sequences.
Pairwise alignments compute distance matrix
FHIT_HUMAN APH1_SCHPO HNT2_YEAST
Y866_METJA FHIT_HUMAN APH1_SCHPO 395
HNT2_YEAST 316 380 Y866_METJA 290
300 340
68
Progressive Multiple Alignment
FHIT_HUMAN
Guide Tree
APH1_SCHPO
HNT2_YEAST
Y866_METJA
Pairwise alignments compute distance matrix
FHIT_HUMAN APH1_SCHPO HNT2_YEAST
Y866_METJA FHIT_HUMAN APH1_SCHPO 395
HNT2_YEAST 316 380 Y866_METJA 290
300 340
69
Multiple Alignment
FHIT_HUMAN MSFRFGQHLIKPSVVFLKTELSFALVNRKPVVPGHVLV.
.. APH1_SCHPO MPKQLYFSKFPVGSQVFYRTKLSAAFVNLKPILPGH
VLV... HNT2_YEAST MILSKTKKPKSMNKPIYFSKFLVTEQVFYKSK
YTYALVNLKPIVPGHVLI... Y866_METJA
MCIFCKIINGEIPAKVVYEDEHVLAFLDINPRNKGHTLV...
FHIT_HUMAN MSFR FGQHLIKP-SVVFL KTELSFALVNRKPVV
PGHVLV... APH1_SCHPO MPKQ LYFSKFPVGSQVFY
RTKLSAAFVNLKPIL PGHVLV... HNT2_YEAST
MILSKTKKPKSMNKPIYFSKFLVTEQVFYKSKYTYALVNLKPIVPGHVLI
... Y866_METJA MCIF CKIINGEIPAKVVYEDEHVLAFLDINPRNK
GHTLV...
Align two closest sequences
This alignment creates a consensus sequence that
is next used to align subsequent sequences.
70
Multiple Alignment
FHIT_HUMAN MS-F RFGQHLIKP-SVVFL KTELSFALVNRKPVV
PGHVLV... APH1_SCHPO MPKQ LYFSKFPVG-SQVFY
RTKLSAAFVNLKPIL PGHVLV... HNT2_YEAST
MILSKTKKPKSMNKPIYFSKFLVTEQVFYKSKYTYALVNLKPIVPGHVLI
... Y866_METJA MCIFCKIINGEIP-AKVVYEDEHVLAFLDINPRNK
GHTLV...
FHIT_HUMAN -----------MSF RFGQHLIKP-SVVFL
KTELSFALVNRKPVV PGHVLV... APH1_SCHPO
-----------MPK QLYFSKFPVGSQVFY RTKLSAAFVNLKPIL
PGHVLV... HNT2_YEAST MILSKTKKPKSMNK
PIYFSKFLVTEQVFY KSKYTYALVNLKPIV
PGHVLI... Y866_METJA MCIF CKIINGEIPAKVVYEDEHVLAFLD
INPRNKGHTLV...
Align Next closest sequence to the consensus.
71
Multiple Alignment
FHIT_HUMAN -----------MS-F RFGQHLIKP-SVVFL
KTELSFALVNRKPVV PGHVLV... APH1_SCHPO
-----------MPKQ LYFSKFPVG-SQVFY RTKLSAAFVNLKPIL
PGHVLV... HNT2_YEAST MILSKTKKPKSMNKP
IYFSKFLVT-EQVFY KSKYTYALVNLKPIV
PGHVLI... Y866_METJA MCIFCKIINGEIPAKVVYEDEHVLAFLDI
NPRNKGHTLV...
FHIT_HUMAN -----------MSFR FGQHLIKP-SVVFL
KTELSFALVNRKPVV PGHVLV... APH1_SCHPO
-----------MPKQ LYFSKFPVGSQVFY RTKLSAAFVNLKPIL
PGHVLV... HNT2_YEAST MILSKTKKPKSMNKP
IYFSKFLVTEQVFY KSKYTYALVNLKPIV PGHVLI... Y866_METJ
A -----------MCIF CKIINGEIPAKVVY EDEHVLAFLDINPRN
KGHTLV...
Align Next closest sequence to new consensus.
Hopefully, the result should be similar to what a
true multiple alignment method would have
yielded. We saw that the order of alignment
determines the existence of gaps.
72
CLUSTALW
Clustalw is a progressive multiple
alignment tool. -
Adaptive gap opening and extension scores, makes
it relatively insensitive to small changes in
gap parameters.
- Choice of DNA or protein gap penalty
alignments. -
Available on the web or on PC/Mac/unix. http//dot
.imgen.bcm.tmc.edu9331/multi-align/Options/clusta
lw.html The uppercase
O in options is relevant.
73
BLAST and BLAST2SEQUENCES
BLAST is a database search engine
based on using hashing to accelerate the
search. blastn (for nucleotides) or blastp
(for proteins) blastx (translates a nucleotide
query in all 6 reading frames and compare it
to a protein database.) tblastn (compare a
protein against a nucleotide database
translated in all 6 reading frames.) tblastx
(compares a nucleotide sequence against a
nucleotide database by translating the query
and database in all 6 reading
frames.) http//www.ncbi.nlm.nih.gov/BLAST/ A
pairwise alignment implementation of
these program is available at http//www.ncbi
.nlm.nih.gov/blast/bl2seq/bl2.html
74
Query-Anchored Alignments (master Slave)
Clustalw
Is a multiple alignment program. Every Sequence
is aligned to every other one.
NOT a multiple alignment program, but may display
Query-Anchored multiple pairwise alignments that
look like multiple alignment, but all the
sequences are only aligned to the first sequence!
Blast
Gaps in the query, means NOTHING can be aligned
to it. Gaps may optionally be shown(flat view),
or entire column omitted.
This Column is NOT aligned together. It is
displayed there for convenience.
Gap in subject sequence
75
BLAST and BLAST2SEQUENCES
Exercizes Use Entrez to find the protein
sequences with LOCUS name FHIT_HUMAN HNT2_YEAST
Use clustalw to align these two sequences, And
WITHOUT LOSING THAT RESULT SCREEN!!! Use pairwise
blast to align these two sequences as
well. EXERCIZE Try to reproduce the example of
clustalW alignment (the order of input sequences
is not important)
76
References
TextBook "Bioinformatics" A Practical
Guide to the Analysis of Genes and Proteins.
Edited by Andy D. Baxevanis and B.F.
Ouellette readings chapters 7,8,9
http//www.ncbi.nlm.nih.gov/BLAST/blast_overview.h
tml
Write a Comment
User Comments (0)
About PowerShow.com