GLOBAL PAIRWISE ALIGNMENT GLOBAL ALIGNMENT OF: 2 NUCLEOTIDE SEQUENCES OR 2 AMINO-ACID SEQUENCES - PowerPoint PPT Presentation

About This Presentation
Title:

GLOBAL PAIRWISE ALIGNMENT GLOBAL ALIGNMENT OF: 2 NUCLEOTIDE SEQUENCES OR 2 AMINO-ACID SEQUENCES

Description:

GLOBAL PAIRWISE ALIGNMENT GLOBAL ALIGNMENT OF: 2 NUCLEOTIDE SEQUENCES OR 2 AMINO-ACID SEQUENCES * To derive a mutational probability matrix for a protein sequence ... – PowerPoint PPT presentation

Number of Views:252
Avg rating:3.0/5.0
Slides: 129
Provided by: bio1171
Learn more at: http://nsmn1.uh.edu
Category:

less

Transcript and Presenter's Notes

Title: GLOBAL PAIRWISE ALIGNMENT GLOBAL ALIGNMENT OF: 2 NUCLEOTIDE SEQUENCES OR 2 AMINO-ACID SEQUENCES


1
GLOBAL PAIRWISE ALIGNMENT GLOBAL ALIGNMENT
OF 2 NUCLEOTIDE SEQUENCES OR 2 AMINO-ACID
SEQUENCES
2
Assumptions Life is monophyletic Biological
entities (sequences, taxa) share common ancestry
3
Any two organisms share a common ancestor in
their past
4
(5 MYA)
5
(120 MYA)
ancestor
6
(1,500 MYA)
ancestor
7
(1) Speciation events (2) Gene duplication (3)
Duplicative transposition
Homologous sequences
8
Homology A term coined by Richard Owen in 1843.
Definition Similarity resulting from common
ancestry.
9
Homology
  • There are three main types of molecular homology
    orthology, paralogy (including ohnology) and
    xenology.

10
Homology General Definition
  • Homology designates a qualitative relationship of
    common descent between entities
  • Two genes are either homologous or they are not!
  • it doesnt make sense to say two genes are 43
    homologous.
  • it doesnt make sense to say Linda is 43
    pregnant.

11
Orthology Paralogy
  • Two genes are orthologs if they originated from a
    single ancestral gene in the most recent common
    ancestor of their respective genomes
  • Two genes are paralogs if they are related by
    gene duplication. Two genes are ohnologs if they
    are related by gene duplication due to genome
    duplication

12
(No Transcript)
13
Gene death
14
Xenology is due to horizontal (lateral) gene
transfer (HGT or LGT)
XA and XB are xenologs
Distinguishing orthologs from xenologs is
impossible in pairwise genomic comparisons, but
possible when multiple genomes are compared
15
Orthology, Paralogy, Xenology(Fitch, Trends in
Genetics, 2000. 16(5)227-231)
16
Homology
By comparing homologous characters, we can
reconstruct the evolutionary events that have led
to the formation of the extant sequences from the
common ancestor.
17
Homology
When comparing sequences, we are interested in
POSITIONAL HOMOLOGY. We identify POSITIONAL
HOMOLOGY through SEQUENCE ALIGNMENT.
18
Positional homology In pairwise alignment, a
pair of nucleotides from two homologous sequences
that have descended from one nucleotide in the
ancestor of the two sequences.
Alignment A hypothesis concerning positional
homology among residues from two or more sequence.
19
Sequence alignment involves the identification of
the correct location of deletions and insertions
that have occurred in either of the two lineages
since their divergence from a common ancestor.
20
(No Transcript)
21
Unknown sequence
Unknown events unknown sequence of events
Unknown events unknown sequence of events
The true alignment is unknown.
22
There are two modes of alignment. Global
alignment each residue of sequence A is compared
with each residue in sequence B. Global alignment
algorithms are used in comparative and
evolutionary studies. Local alignment
Determining if sub-segments of one sequence are
present in another. Local alignment methods have
their greatest utility in database searching and
retrieval (e.g., BLAST).
23
For reasons of computational complexity, sequence
alignment is divided into two categories
Pairwise alignment (i.e., the alignment of two
sequences). Multiple-sequence alignment (i.e.,
the alignment of three or more sequences).
Pairwise alignment problems have exact
solutions. Multiple-sequence alignment problems
only have approximate (heuristic) solutions.
24
A pairwise alignment consists of a series of
paired bases, one base from each sequence. There
are three types of pairs(1) matches the same
nucleotide appears in both sequences. (2)
mismatches different nucleotides are found in
the two sequences. (3) gaps a base in one
sequence and a null base in the other.
GCGGCCCATCAGGTAGTTGGTG-G GCGTTCCATC--CTGGTTGGTGTG
25
-Two DNA sequences A and B.-Lengths are m and
n, respectively. -The number of matched pairs is
x. -The number of mismatched pairs is y. -
Total number of bases in gaps is z.
26
There are internal and terminal gaps.
GCGG-CCATCAGGTAGTTGGTG-- GCGTTCCATC--CTGGTTGGTGTG
27
A terminal gap may indicate missing data.
GCGG-CCATCAGGTAGTTGGTG-- GCGTTCCATC--CTGGTTGGTGTG
28
An internal gap indicates that a deletion or an
insertion has occurred in one of the two
lineages.
GCGG-CCATCAGGTAGTTGGTG-- GCGTTCCATC--CTGGTTGGTGTG
29
When sequences are compared through alignment, it
is impossible to tell whether a deletion has
occurred in one sequence or an insertion has
occurred in the other. Thus, deletions and
insertions are collectively referred to as indels
(short for insertion or deletion).
GCGG-CCATCAGGTAGTTGGTG-- GCGTTCCATC--CTGGTTGGTGTG
30
The alignment is the first step in many
functional and evolutionary studies. Errors in
alignment tend to amplify in later stages of the
study.
31
Motivation for sequence alignment
  • Function
  • Similarity may be indicative of similar function.
  • Evolution
  • Similarity may be indicative of common ancestry.

32
Some definitions
33
An example of pairwise alignment of an unknown
protein with a known one
  • Glutaredoxin, Bacteriophage T4 from E. coli, 87
    aa
  • (B) Unknown protein - 93 aa

Unknown protein, Bacteriophage 65 from Aeromonas
sp. 93 aa
10 20 30 40
50 Glutar KVYGYDSNIHKCVYCDNAKRLLTVK
KQPFEFINIMPEKGV---FDDEKIAELLTKLGR ..
.. . .. .. . . .
.. . Unknow EIYGIPEDVAKCSGCISAIRLCFEKGYDYEIIPVLKK
ANNQLGFDYILEKFDECKARANM 10 20
30 40 50 60
60 70 80 Glutar
DTQIGLTMPQVFAPDGSHIGGFDQLREYF ..
..... .... ... .Unknow QTR-PTSFPRIFV-DGQYI
GSLKQFKDLY 70 80 90
Is the unknown protein a glutaredoxin?
34
Methods of alignment 1. Manual 2. Dot
matrix 3. Distance Matrix 4. Combined (Distance
Manual)
35
  • Manual alignment. When there are few gaps and the
    two sequences are not too different from each
    other, a reasonable alignment can be obtained by
    visual inspection.

GCG-TCCATCAGGTAGTTGGTGTG GCGATCCATCAGGTGGTTGGTGTG
36
Advantages of manual alignment (1) use of a
powerful and trainable tool (the brain, well
some brains).(2) ability to integrate
additional data, e.g., domain structure,
biological function.
37
(No Transcript)
38
Protein Alignment may be guided by Secondary and
Tertiary Structures
Escherichia coli DjlA protein
Homo sapiens DjlA protein
39
Disadvantages of manual alignment
subjectivity (the algorithm is unspecified)
irreproducibility (the results cannot be
independently reproduced) unscalability
(inapplicable to long sequences)incommensurabili
ty (the results cannot be compared to those
obtained by other methods)
40
The dot-matrix method (Gibbs and McIntyre, 1970)
The two sequences are written out as column and
row headings of a two-dimensional matrix. A dot
is put in the dot-matrix plot at a position where
the nucleotides in the two sequences are
identical.
41
The alignment is defined by a path from the
upper-left element to the lower-right element.
42
There are 4 possible steps in the path
  • (1) a diagonal step through a dot match.
  • (2) a diagonal step through an empty element of
    the matrix mismatch.
  • (3) a horizontal step a gap in the sequence on
    the left of the matrix.
  • (4) a vertical step a gap in the sequence on
    the top of the matrix.

43
A dot matrix may become cluttered. With DNA
sequences, 25 of the elements will be occupied
by dots by chance alone.
44
window size 1 stringency 1 alphabet size 4
The number of spurious matches is determined by
window size (how many residues are compared),
stringency (the minimum number of matches for a
hit), alphabet size (number of characters
states). Window size must be an odd number.
45
window size 1 stringency 1 alphabet size 4
window size 3 stringency 2 alphabet size 4
46
window size 1 stringency 1 alphabet size 20
47
Dot-matrix methodsAdvantages By being a
visual representation, and humans being visual
animals, the method may unravel information on
the evolution of sequences that cannot easily be
gleaned from a line alignment.Disadvantages
May not identify the best possible alignment.
48
Window size 60 amino acids Stringency 24
matches
Advantages Highlighting Information
49
Window size 60 amino acids Stringency 24
matches
Advantages Highlighting Information
The two pairs of diagonally oriented parallel
lines most probably indicate that two small
internal duplications occurred in the bacterial
gene.
50
Disadvantages Not possible to identify the
best alignment.
51
Scoring Matrices Gap Penalties
52
The true alignment between two sequences is the
one that reflects accurately the evolutionary
relationships between the sequences. Since the
true alignment is unknown, in practice we look
for the optimal alignment, which is the one in
which the numbers of mismatches and gaps are
minimized according to certain criteria.
53
Unfortunately, reducing the number of mismatches
results in an increase in the number of gaps, and
vice versa.
54
a matches b mismatches g nucleotides in
gaps d gaps
55
The scoring scheme comprises a gap penalty and a
scoring matrix, M(a,b), that specifies the score
for each type of match (a b) or mismatch (a ?
b). The units in a scoring matrix may be the
nucleotides in the DNA or RNA sequences, the
codons in protein-coding regions, or the amino
acids in protein sequences.
56
DNA scoring matrices are usually simple. In the
simplest scheme all mismatches are given the same
penalty. M(a,b) is positive if a b and
negative otherwise. In more complicated
matrices a distinction may be made between
transition and transversion mismatches or each
type of mismatch may be penalized differently.
57
Further complications Distinguishing among
different matches and mismatches.For example, a
mismatched pair consisting of Leu Ile, which
are very similar biochemically to each other, may
be given a lesser penalty than a mismatched pair
consisting of Arg Glu, which are very
dissimilar from each other.
58
Lesser penalty than
59
BLOSUM62 (BLOcks of amino acid SUbstitution Matrix
60
BLOSUM62 (BLOcks of amino acid SUbstitution Matrix
B asx (asp or asn) X unknown Z glx (glu or
gln) termination codon
61
BLOSUM62 (BLOcks of amino acid SUbstitution Matrix
The matrix is symmetrical
62
BLOSUM62 (BLOcks of amino acid SUbstitution Matrix
Positive numbers on the diagonal
63
BLOSUM62 (BLOcks of amino acid SUbstitution Matrix
Mismatches are usually penalized
64
BLOSUM62 (BLOcks of amino acid SUbstitution Matrix
Some mismatches are not penalized
65
BLOSUM62 (BLOcks of amino acid SUbstitution Matrix
A few mismatches are even rewarded
66
Gap penalty (or cost) is a factor (or a set of
factors) by which the gap values (numbers and
lengths of gaps) are mathematically manipulated
to make the gaps equivalent in value to the
mismatches. The gap penalties are based on our
assessment of how frequent different types of
insertions and deletions occur in evolution in
comparison with the frequency of occurrence of
point substitutions.
67
Mismatches
Gaps
68
The gap penalty has two components a gap-opening
penalty and a gap-extension penalty.
69
Three main gap-penalty systems (1) Fixed
gap-penalty system 0 gap-extension costs.
70
Three main gap-penalty systems (2) Linear
gap-penalty system the gap-extension cost is
calculated by multiplying the gap length minus 1
by a constant representing the gap-extension
penalty for increasing the gap by 1.
71
Three main gap-penalty systems (3)
Logarithmic gap-penalty system the
gap-extension penalty increases with the
logarithm of the gap length, i.e., slower.
72
Alignment algorithms
73
Aim Given a predetermined set of criteria, find
the alignment associated with the best score from
among all possible alignments.The OPTIMAL
ALIGNMENT
74
The number of possible alignments may be
astronomical.
where n and m are the lengths of the two
sequences to be aligned.
75
The number of possible alignments may be
astronomical. For example, when two DNA
sequences 200 residues long each are compared,
there are more than 10153 possible alignments.
In comparison, the number of protons in the
universe is only 1080.
76
FORTUNATELYThere are computer algorithms for
finding the optimal alignment between two
sequences that do not require an exhaustive
search of all the possibilities.
77
The Needleman-Wunsch (1970) algorithmuses
Dynamic Programming
78
Dynamic programming a computational technique.
It is applicable when large searches can be
divided into a succession of small stages, such
that (1) the solution of the initial search stage
is trivial, (2) each partial solution in a later
stage can be calculated by reference to only a
small number of solutions in an earlier stage,
and (3) the last stage contains the overall
solution.
79
Dynamic programming can be applied to problems of
alignment because ALIGNMENT SCORES obey the
following rules
80
Path Graph for aligning two sequences
81
allowed
82
not allowed
83
(No Transcript)
84
Scoring scheme match 5 mismatch
3 gap-opening penalty 4 gap-extension penalty
0
85
match 5, mismatch 3, gap-opening penalty
4, gap-extension penalty 0
Matrix initialization
86
match 5, mismatch 3, gap-opening penalty
4, gap-extension penalty 0
Matrix initialization
0 match 5
87
match 5, mismatch 3, gap-opening penalty
4, gap-extension penalty 0
Matrix initialization
0 gap 4
88
match 5, mismatch 3, gap-opening penalty
4, gap-extension penalty 0
Matrix initialization
0 gap 4
89
match 5, mismatch 3, gap-opening penalty
4, gap-extension penalty 0
Matrix fill
0 match 5
90
match 5, mismatch 3, gap-opening penalty
4, gap-extension penalty 0
Matrix fill
5 gap 1
91
match 5, mismatch 3, gap-opening penalty
4, gap-extension penalty 0
Matrix fill
0 gap 4
92
match 5, mismatch 3, gap-opening penalty
4, gap-extension penalty 0
and so on and so forth
93
match 5, mismatch 3, gap-opening penalty
4, gap-extension penalty 0
Complete matrix fill
94
match 5, mismatch 3, gap-opening penalty
4, gap-extension penalty 0
Trace back
95
The alignment is produced by either starting at
the highest score in either the rightmost column
or the bottom row, and proceeding from right to
left by following the best pointers, or at the
bottom rightmost cell.This stage is called the
traceback. The graph of pointers in the traceback
is also referred to as the path graph because it
defines the paths through the matrix that
correspond to the optimal alignment or
alignments.
96
match 5, mismatch 3, gap-opening penalty
4, gap-extension penalty 0
Trace back (if we DO allow terminal gaps)
97
match 5, mismatch 3, gap-opening penalty
4, gap-extension penalty 0
Trace back (if we DO NOT allow terminal gaps)
98
match 5, mismatch 3, gap-opening penalty
4, gap-extension penalty 0
10 gap ? 11
14 mismatch 11
10 gap ? 11
Trace back (if we DO NOT allow terminal gaps)
99
match 5, mismatch 3, gap-opening penalty
4, gap-extension penalty 0
10 gap ? 14
9 match 14
5 gap ? 14
Trace back (if we DO NOT allow terminal gaps)
100
match 5, mismatch 3, gap-opening penalty
4, gap-extension penalty 0
4 mismatch ? 9
13 gap 9
0 gap ? 9
Trace back (if we DO NOT allow terminal gaps)
101
match 5, mismatch 3, gap-opening penalty
4, gap-extension penalty 0
8 match 13
4 gap ? 13
9 gap ? 13
Trace back (if we DO NOT allow terminal gaps)
102
match 5, mismatch 3, gap-opening penalty
4, gap-extension penalty 0
1 gap ? 8
12 gap 8
3 match 8
Trace back (if we DO NOT allow terminal gaps)
103
match 5, mismatch 3, gap-opening penalty
4, gap-extension penalty 0
7 gap ? 12
7 match 12
3 gap ? 12
7 gap 3
6 gap ? 3
2 mismatch ? 3
Trace back (if we DO NOT allow terminal gaps)
104
match 5, mismatch 3, gap-opening penalty
4, gap-extension penalty 0

Trace back (if we DO NOT allow terminal gaps)
105
match 5, mismatch 3, gap-opening penalty
4, gap-extension penalty 0
high road/low road/middle road
Trace back (complete)
106
Two possible alignments GAATTCAGT GGA-TC-GA
GAATTCAGT GGAT-C-GA
107
Scoring Matrices
Mismatch and gap penalties should be inversely
proportional to the frequencies with which
changes occur.
108
Transitions (68) occur more frequently than
transversions (32). Mismatch penalties for
transitions should be smaller than those for
transversions.
109
Empirical substitution matrices
PAM (Percent/Point Accepted Mutation) BLOSUM
(BLOcks SUbstitution Matrix)
110
PAM
  • Developed by Margaret Dayhoff in 1978.
  • Based on comparisons of very similar protein
    sequences.

111
Log-odds ratios
  • A scoring matrix is a table of values that
    describe the probability of a residue (amino acid
    or base) pair occurring in an alignment.
  • The values in a scoring matrix are log ratios of
    two probabilities.
  • One is the random probability. The other
    is the probability of a empirical pair
    occurrence.
  • Because the scores are logarithms of probability
    ratios, they can be added to give a meaningful
    score for the entire alignment. The more
    positive the score, the better the alignment!

112
The PAM matrices (Percent accepted mutations)
  • Align sequences that are at least 85 identical.
  • Minimizes ambiguity in alignments and the number
    of coincident mutations.
  • Reconstruct phylogenetic trees and infer
    ancestral sequences.
  • Tally replacements "accepted" by natural
    selection, in all pairwise comparisons.
  • Meaning, the number of times j was replaced by i
    in all comparisons.
  • Compute amino acid mutability (i.e., the
    propensity of a given amino acid, j, to be
    replaced).

113
The PAM matrices
  • Combine data to produce a Mutation Probability
    Matrix for one PAM of evolutionary distance,
    which is used to calculate the Log Odds Matrix
    for similarity scoring.
  • Thus, depending on the protein family used,
    various PAM matrices result - some of which are
    good at locating evolutionary distant conserved
    mutations and some that are good at locating
    evolutionary close conserved mutations.

114
More on log-odds ratios
In PAM log-odds scores are multiplied by 10 to
avoid decimals. Therefore, a PAM score of 2
actually corresponds to a log-odds ratio of 0.2.
0.2 substitioni to j log10 (observed ij
mutation rate) / (expected rate) The value
0.2 is log10 of the relative expectation value of
the mutation. Therefore, the expectation value
is 100.2 1.6. So, a PAM score of 2 indicates
that (in related sequences) the mutation would be
expected to occur 1.6 times more frequently than
random.
115
PAM250
  • Calculated for families of related proteins (gt85
    identity)
  • 1 PAM is the amount of evolutionary change that
    yields, on average, one substitution in 100 amino
    acid residues
  • A positive score signifies a common replacement
    whereas a negative score signifies an unlikely
    replacement
  • PAM250 matrix assumes/is optimized for sequences
    separated by 250 PAM, i.e. 250 substitutions in
    100 amino acids (longer evolutionary time)

116
PAM250
Sequence alignment matrix that allows 250
accepted point mutations per 100 amino acids.
PAM250 is suitable for comparing distantly
related sequences, while a lower PAM is suitable
for comparing more closely related sequences.
117
Selecting a PAM Matrix
  • Low PAM numbers short sequences, strong local
    similarities.
  • High PAM numbers long sequences, weak
    similarities.
  • PAM60 for close relations (60 identity)
  • PAM120 recommended for general use (40 identity)
  • PAM250 for distant relations (20 identity)
  • If uncertain, try several different matrices
  • PAM40, PAM120, PAM250 recommended.

118
BLOSUM
  • Blocks Substitution Matrix
  • Steven and Jorga G. Henikoff (1992).
  • Based on BLOCKS database (www.blocks.fhcrc.org)
  • Families of proteins with identical function.
  • Highly conserved protein domains.
  • Ungapped local alignment to identify motifs
  • Each motif is a block of local alignment.
  • Counts amino acids observed in same column.
  • Symmetrical model of substitution.


119
BLOSUM62
  • BLOSUM matrices are based on local alignments
    (blocks or conserved amino acid patterns).
  • BLOSUM 62 is a matrix calculated from comparisons
    of sequences with no less than 62 divergence.
  • All BLOSUM matrices are based on observed
    alignments they are not extrapolated from
    comparisons of closely related proteins.
  • BLOSUM 62 is the default matrix in BLAST 2.0.

120
BLOSUM Matrices
  • Different BLOSUMn matrices are calculated
    independently from BLOCKS
  • BLOSUMn is based on sequences that are at most n
    percent identical.

121
BLOSUM62
The procedure for calculating a BLOSUM matrix is
based on a likelihood method estimating the
occurrence of each possible pairwise
substitution. Only aligned blocks are used to
calculate the BLOSUMs. The higher the score The
more closely related sequences.
122
Why is BLOSUM62 called BLOSUM62?
Because all blocks whose members shared at least
62 identity with ANY other member of that block
were averaged and represented as 1 sequence.
123
Selecting a BLOSUM Matrix
  • For BLOSUMn, higher n suitable for sequences
    which are more similar
  • BLOSUM62 recommended for general use
  • BLOSUM80 for close relations
  • BLOSUM45 for distant relations

124
  • Equivalent PAM and Blosum matricesThe following
    matrices are roughly equivalent...
  • PAM100 gt Blosum90
  • PAM120 gt Blosum80
  • PAM160 gt Blosum60
  • PAM200 gt Blosum52
  • PAM250 gt Blosum45Generally speaking...
  • The Blosum matrices are best for detecting local
    alignments.
  • The Blosum62 matrix is the best for detecting the
    majority of weak protein similarities.
  • The Blosum45 matrix is the best for detecting
    long and weak alignments.

Less divergent
More divergent
125
Comparison of PAM250 and BLOSUM62
The relationship between BLOSUM and PAM
substitution matrices BLOSUM matrices with
higher numbers and PAM matrices with low numbers
are both designed for comparisons of closely
related sequences. BLOSUM matrices with low
numbers and PAM matrices with high numbers are
designed for comparisons of distantly related
proteins. If distant relatives of the query
sequence are specifically being sought, the
matrix can be tailored to that type of search.
126
Scoring matrices commonly used
  • PAM250
  • Shown to be appropriate for searching for
    sequences of 17-27 identity.
  • BLOSUM62
  • Though it is tailored for comparisons of
    moderately distant proteins, it performs well in
    detecting closer relationships.
  • BLOSUM50
  • Shown to be better for FASTA searches.

127
Effect of gap penalties on amino-acid alignment
Human pancreatic hormone precursor versus
chicken pancreatic hormone (a) Penalty
for gaps is 0 (b) Penalty for a gap of size k
nucleotides is wk 1 0.1k (c) The same
alignment as in (b), only the similarity between
the two sequences is further enhanced by showing
pairs of biochemically similar amino acids
128
Alignments things to keep in mind
  • Optimal alignment means having the highest
    possible score, given a substitution matrix and a
    set of gap penalties
  • This is NOT necessarily the most meaningful
    alignment
  • The assumptions of the algorithm are often wrong
  • - substitutions are not equally frequent at all
    positions,
  • - it is very difficult to realistically model
    insertions and deletions.
  • Pairwise alignment programs ALWAYS produce an
    alignment (even when it does not make sense to
    align sequences)

129
(No Transcript)
Write a Comment
User Comments (0)
About PowerShow.com